Edited/reposted with correct sample output.
I have a dataframe that looks like the following:
JavaScript
x
22
22
1
data = {
2
"ID": [1, 1, 1, 2, 2, 2],
3
"Year": [2021, 2021, 2023, 2015, 2017, 2018],
4
"Combined": ['started', 'finished', 'started', 'started', 'finished', 'started'],
5
"bool": [True, False, False, True, False, False],
6
"Update": ['started', 'finished', 'started', 'started', 'finished', 'started']
7
8
}
9
10
df = pd.DataFrame(data)
11
12
print(df)
13
14
ID Year Combined bool
15
1 2021 started True
16
1 2021 finished False
17
1 2023 started False
18
2 2015 started True
19
2 2017 finished False
20
2 2018 started False
21
22
This dataframe is split into groups by ID
.
I would like to make an updated combined
column based on if df['bool'] == True
, but only if df['bool'] == True
AND there is another ‘finished’ row in the same group with a LATER (not the same) year.
Sample output:
JavaScript
1
9
1
ID Year Combined bool Update
2
1 2021 started True started
3
1 2021 finished False finished
4
1 2023 started False started
5
2 2015 started True finished
6
2 2017 finished False finished
7
2 2018 started False started
8
9
We are not updating the first group because there is not a finished
value in a LATER year, and we are updating the second group because there is a finished
value in a later year. Thank you!
Advertisement
Answer
This uses temporary columns, and avoids the apply path which can be generally slow:
JavaScript
1
23
23
1
# identify the start rows that have a True value
2
start_true = df.Combined.eq('started') & df['bool']
3
4
# identify rows where Combined is finished
5
condition = df.Combined.eq('finished')
6
7
# create more temporary variables
8
year_shift = df.Year.where(condition).bfill()
9
id_shift = df.ID.where(condition).bfill()
10
condition = df.ID.eq(id_shift) & df.Year.lt(year_shift)
11
12
# if it matches, 'finished', else just return what is in the Combined column
13
update = np.where(condition, 'finished', df.Combined)
14
df.assign(Update = update)
15
16
ID Year Combined bool Update
17
0 1 2021 started True started
18
1 1 2021 finished False finished
19
2 1 2023 started False started
20
3 2 2015 started True finished
21
4 2 2017 finished False finished
22
5 2 2018 started False started
23
This solution assumes that the data is sorted on ID and Year in ascending order