Skip to content
Advertisement

How to calculate cumulative subtraction with a threshold and reset the subtraction after threshold within groups in pandas dataframe in python?

This is a dataframe, with 4 columns. The primary dataframe contains two columns, trip and timestamps, and I calculated ‘TimeDistance’ which is the difference between rows of timestamps, and ‘cum’ which is the cumulative sum over TimeDistance column. in order to reach my goal, but I could not.

import pandas as pd
import numpy as np

data={'trip':[1,1,1,1,1,1,1,2,2,2,2,2,2,3,3,4,4,5,5,5],
      'timestamps':[1235471761, 1235471763, 1235471765, 1235471767, 1235471778, 1235471780, 1235471782, 1235471784, 1235471786, 1235471788,1235471820,1235471826,1235471829,1235471890,1235471893,1235471894,1235471896,1235471900,1235471910,1235471912]}

df = pd.DataFrame(data)
df['TimeDistance'] = df.groupby('trip')['timestamps'].diff(1)
df['cum']=df.groupby('trip')['TimeDistance'].cumsum()
df

this is the output:

    trip    timestamps  TimeDistance    cum
0   1   1235471761  NaN NaN
1   1   1235471763  2.0 2.0
2   1   1235471765  2.0 4.0
3   1   1235471767  2.0 6.0
4   1   1235471778  11.0    17.0
5   1   1235471780  2.0 19.0
6   1   1235471782  2.0 21.0
7   2   1235471784  NaN NaN
8   2   1235471786  2.0 2.0
9   2   1235471788  2.0 4.0
10  2   1235471820  32.0    36.0
11  2   1235471826  6.0 42.0
12  2   1235471829  3.0 45.0
13  3   1235471890  NaN NaN
14  3   1235471893  3.0 3.0
15  4   1235471894  NaN NaN
16  4   1235471896  2.0 2.0
17  5   1235471900  NaN NaN
18  5   1235471910  10.0    10.0
19  5   1235471912  2.0 12.0

This output is not my desired output, I want to subtract each row of the timestamp column from the first row for each trip, store it in a new column (cum), and whenever it reaches 10, do these for the next rows:

  • reset the subtraction,
  • the next row after the row in which the threshold is reached will be considered as the origin and it must be equal to zero,
  • continue subtraction from this row (which is equal to zero) and subsequent rows again until we reach 10.
  • Whenever we reach the end of a trip, the subtraction will also reset for a new trip.
  • Repeat this procedure for all trips.

for example, in row 4, we have reached to threshold because the value in ‘cum’ column is 17, so, the next row in the ‘cum’ column must be 0 (but it is 19) and for row 6, we have to calculate the difference between timestamps in row 5, 6 that should be 2, not 19!

for more clarity, I have attached a screenshot from my desired output

Advertisement

Answer

You can use a mask to reset the cumsum:

df['TimeDistance'] = df.groupby('trip')['timestamps'].diff(1)

# get rows above threshold
m = df['TimeDistance'].gt(10).groupby(df['trip']).shift(fill_value=False)

df['cum'] = (df['TimeDistance']
             .mask(m, 0)
             .groupby([df['trip'], m.cumsum()])
             .cumsum()
            )

output:

    trip  timestamps  TimeDistance   cum
0      1  1235471761           NaN   NaN
1      1  1235471763           2.0   2.0
2      1  1235471765           2.0   4.0
3      1  1235471767           2.0   6.0
4      1  1235471778          11.0  17.0
5      1  1235471780           2.0   0.0
6      1  1235471782           2.0   2.0
7      2  1235471784           NaN   NaN
8      2  1235471786           2.0   2.0
9      2  1235471788           2.0   4.0
10     2  1235471820          32.0  36.0
11     2  1235471826           6.0   0.0
12     2  1235471829           3.0   3.0
13     3  1235471890           NaN   NaN
14     3  1235471893           3.0   3.0
15     4  1235471894           NaN   NaN
16     4  1235471896           2.0   2.0
17     5  1235471900           NaN   NaN
18     5  1235471910          10.0  10.0
19     5  1235471912           2.0  12.0
User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement