Edited/reposted with correct sample output.
I have a dataframe that looks like the following:
data = {
"ID": [1, 1, 1, 2, 2, 2],
"Year": [2021, 2021, 2023, 2015, 2017, 2018],
"Combined": ['started', 'finished', 'started', 'started', 'finished', 'started'],
"bool": [True, False, False, True, False, False],
"Update": ['started', 'finished', 'started', 'started', 'finished', 'started']
}
df = pd.DataFrame(data)
print(df)
ID Year Combined bool
1 2021 started True
1 2021 finished False
1 2023 started False
2 2015 started True
2 2017 finished False
2 2018 started False
This dataframe is split into groups by ID.
I would like to make an updated combined column based on if df['bool'] == True, but only if df['bool'] == True AND there is another ‘finished’ row in the same group with a LATER (not the same) year.
Sample output:
ID Year Combined bool Update 1 2021 started True started 1 2021 finished False finished 1 2023 started False started 2 2015 started True finished 2 2017 finished False finished 2 2018 started False started
We are not updating the first group because there is not a finished value in a LATER year, and we are updating the second group because there is a finished value in a later year. Thank you!
Advertisement
Answer
This uses temporary columns, and avoids the apply path which can be generally slow:
# identify the start rows that have a True value
start_true = df.Combined.eq('started') & df['bool']
# identify rows where Combined is finished
condition = df.Combined.eq('finished')
# create more temporary variables
year_shift = df.Year.where(condition).bfill()
id_shift = df.ID.where(condition).bfill()
condition = df.ID.eq(id_shift) & df.Year.lt(year_shift)
# if it matches, 'finished', else just return what is in the Combined column
update = np.where(condition, 'finished', df.Combined)
df.assign(Update = update)
ID Year Combined bool Update
0 1 2021 started True started
1 1 2021 finished False finished
2 1 2023 started False started
3 2 2015 started True finished
4 2 2017 finished False finished
5 2 2018 started False started
This solution assumes that the data is sorted on ID and Year in ascending order