I have data ordered by ID, Year, and then a series of event flags indicating whether a thing did or did not happen for that ID in that year:
ID | Year | x | y | z |
---|---|---|---|---|
1 | 2015 | 0 | 1 | 0 |
1 | 2016 | 1 | 1 | 0 |
1 | 2017 | 0 | 1 | 1 |
2 | 2015 | 1 | 0 | 1 |
2 | 2016 | 1 | 1 | 0 |
2 | 2017 | 0 | 1 | 1 |
I’d like to group by ID and Year and apply a cumulative count to each “event” column, such that I’m left with something like the following
ID | Year | x_total | y_total | z_total |
---|---|---|---|---|
1 | 2015 | 0 | 1 | 0 |
1 | 2016 | 1 | 2 | 0 |
1 | 2017 | 1 | 3 | 1 |
2 | 2015 | 1 | 0 | 1 |
2 | 2016 | 2 | 1 | 1 |
2 | 2017 | 2 | 2 | 2 |
I’ve looked at various options using cumsum
and cumcount
but I can’t seem to figure this out.
Advertisement
Answer
You can use .groupby()
+ .cumsum()
to get the cumulative count to each “event” column. Then add _total
as suffix to the column names by .add_suffix()
and then join with the first 2 columns:
df[['ID', 'Year']].join(df.groupby('ID')[['x', 'y', 'z']].cumsum().add_suffix('_total'))
Result:
ID Year x_total y_total z_total 0 1 2015 0 1 0 1 1 2016 1 2 0 2 1 2017 1 3 1 3 2 2015 1 0 1 4 2 2016 2 1 1 5 2 2017 2 2 2