Edited/reposted with correct sample output.
I have a dataframe that looks like the following:
data = {
    "ID": [1, 1, 1, 2, 2, 2],
  "Year": [2021, 2021, 2023, 2015, 2017, 2018],
  "Combined": ['started', 'finished', 'started', 'started', 'finished', 'started'],
  "bool": [True, False, False, True, False, False],
"Update": ['started', 'finished', 'started', 'started', 'finished', 'started']
}
df = pd.DataFrame(data)
print(df) 
ID  Year  Combined   bool  
1  2021   started   True   
1  2021  finished  False  
1  2023   started  False   
2  2015   started   True   
2  2017  finished  False  
2  2018   started  False   
This dataframe is split into groups by ID.
I would like to make an updated combined column based on if df['bool'] == True, but only if df['bool'] == True AND there is another ‘finished’ row in the same group with a LATER (not the same) year.
Sample output:
ID Year Combined bool Update 1 2021 started True started 1 2021 finished False finished 1 2023 started False started 2 2015 started True finished 2 2017 finished False finished 2 2018 started False started
We are not updating the first group because there is not a finished value in a LATER year, and we are updating the second group because there is a finished value in a later year. Thank you!
Advertisement
Answer
This uses temporary columns, and avoids the apply path which can be generally slow:
# identify the start rows that have a True value
start_true = df.Combined.eq('started') & df['bool']
# identify rows where Combined is finished
condition = df.Combined.eq('finished')
# create more temporary variables
year_shift = df.Year.where(condition).bfill()
id_shift = df.ID.where(condition).bfill()
condition = df.ID.eq(id_shift) & df.Year.lt(year_shift)
# if it matches, 'finished', else just return what is in the Combined column
update = np.where(condition, 'finished', df.Combined)
df.assign(Update = update)
   ID  Year  Combined   bool    Update
0   1  2021   started   True   started
1   1  2021  finished  False  finished
2   1  2023   started  False   started
3   2  2015   started   True  finished
4   2  2017  finished  False  finished
5   2  2018   started  False   started
This solution assumes that the data is sorted on ID and Year in ascending order