Edited/reposted with correct sample output.
I have a dataframe that looks like the following:
data = { "ID": [1, 1, 1, 2, 2, 2], "Year": [2021, 2021, 2023, 2015, 2017, 2018], "Combined": ['started', 'finished', 'started', 'started', 'finished', 'started'], "bool": [True, False, False, True, False, False], "Update": ['started', 'finished', 'started', 'started', 'finished', 'started'] } df = pd.DataFrame(data) print(df) ID Year Combined bool 1 2021 started True 1 2021 finished False 1 2023 started False 2 2015 started True 2 2017 finished False 2 2018 started False
This dataframe is split into groups by ID
.
I would like to make an updated combined
column based on if df['bool'] == True
, but only if df['bool'] == True
AND there is another ‘finished’ row in the same group with a LATER (not the same) year.
Sample output:
ID Year Combined bool Update 1 2021 started True started 1 2021 finished False finished 1 2023 started False started 2 2015 started True finished 2 2017 finished False finished 2 2018 started False started
We are not updating the first group because there is not a finished
value in a LATER year, and we are updating the second group because there is a finished
value in a later year. Thank you!
Advertisement
Answer
This uses temporary columns, and avoids the apply path which can be generally slow:
# identify the start rows that have a True value start_true = df.Combined.eq('started') & df['bool'] # identify rows where Combined is finished condition = df.Combined.eq('finished') # create more temporary variables year_shift = df.Year.where(condition).bfill() id_shift = df.ID.where(condition).bfill() condition = df.ID.eq(id_shift) & df.Year.lt(year_shift) # if it matches, 'finished', else just return what is in the Combined column update = np.where(condition, 'finished', df.Combined) df.assign(Update = update) ID Year Combined bool Update 0 1 2021 started True started 1 1 2021 finished False finished 2 1 2023 started False started 3 2 2015 started True finished 4 2 2017 finished False finished 5 2 2018 started False started
This solution assumes that the data is sorted on ID and Year in ascending order