Skip to content
Advertisement

Update column based on grouped date values

Edited/reposted with correct sample output.

I have a dataframe that looks like the following:

data = {
    "ID": [1, 1, 1, 2, 2, 2],
  "Year": [2021, 2021, 2023, 2015, 2017, 2018],
  "Combined": ['started', 'finished', 'started', 'started', 'finished', 'started'],
  "bool": [True, False, False, True, False, False],
"Update": ['started', 'finished', 'started', 'started', 'finished', 'started']

}

df = pd.DataFrame(data)

print(df) 

ID  Year  Combined   bool  
1  2021   started   True   
1  2021  finished  False  
1  2023   started  False   
2  2015   started   True   
2  2017  finished  False  
2  2018   started  False   

This dataframe is split into groups by ID.

I would like to make an updated combined column based on if df['bool'] == True, but only if df['bool'] == True AND there is another ‘finished’ row in the same group with a LATER (not the same) year.

Sample output:

ID  Year  Combined   bool    Update
1  2021   started   True   started
1  2021  finished  False  finished
1  2023   started  False   started
2  2015   started   True   finished
2  2017  finished  False  finished
2  2018   started  False   started

We are not updating the first group because there is not a finished value in a LATER year, and we are updating the second group because there is a finished value in a later year. Thank you!

Advertisement

Answer

This uses temporary columns, and avoids the apply path which can be generally slow:

# identify the start rows that have a True value
start_true = df.Combined.eq('started') & df['bool']

# identify rows where Combined is finished
condition = df.Combined.eq('finished')

# create more temporary variables
year_shift = df.Year.where(condition).bfill()
id_shift = df.ID.where(condition).bfill()
condition = df.ID.eq(id_shift) & df.Year.lt(year_shift)

# if it matches, 'finished', else just return what is in the Combined column
update = np.where(condition, 'finished', df.Combined)
df.assign(Update = update)

   ID  Year  Combined   bool    Update
0   1  2021   started   True   started
1   1  2021  finished  False  finished
2   1  2023   started  False   started
3   2  2015   started   True  finished
4   2  2017  finished  False  finished
5   2  2018   started  False   started

This solution assumes that the data is sorted on ID and Year in ascending order

User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement