This is a dataframe, with 4 columns. The primary dataframe contains two columns, trip and timestamps, and I calculated ‘TimeDistance’ which is the difference between rows of timestamps, and ‘cum’ which is the cumulative sum over TimeDistance column. in order to reach my goal, but I could not.
import pandas as pd import numpy as np data={'trip':[1,1,1,1,1,1,1,2,2,2,2,2,2,3,3,4,4,5,5,5], 'timestamps':[1235471761, 1235471763, 1235471765, 1235471767, 1235471778, 1235471780, 1235471782, 1235471784, 1235471786, 1235471788,1235471820,1235471826,1235471829,1235471890,1235471893,1235471894,1235471896,1235471900,1235471910,1235471912]} df = pd.DataFrame(data) df['TimeDistance'] = df.groupby('trip')['timestamps'].diff(1) df['cum']=df.groupby('trip')['TimeDistance'].cumsum() df
this is the output:
trip timestamps TimeDistance cum 0 1 1235471761 NaN NaN 1 1 1235471763 2.0 2.0 2 1 1235471765 2.0 4.0 3 1 1235471767 2.0 6.0 4 1 1235471778 11.0 17.0 5 1 1235471780 2.0 19.0 6 1 1235471782 2.0 21.0 7 2 1235471784 NaN NaN 8 2 1235471786 2.0 2.0 9 2 1235471788 2.0 4.0 10 2 1235471820 32.0 36.0 11 2 1235471826 6.0 42.0 12 2 1235471829 3.0 45.0 13 3 1235471890 NaN NaN 14 3 1235471893 3.0 3.0 15 4 1235471894 NaN NaN 16 4 1235471896 2.0 2.0 17 5 1235471900 NaN NaN 18 5 1235471910 10.0 10.0 19 5 1235471912 2.0 12.0
This output is not my desired output, I want to subtract each row of the timestamp column from the first row for each trip, store it in a new column (cum), and whenever it reaches 10, do these for the next rows:
- reset the subtraction,
- the next row after the row in which the threshold is reached will be considered as the origin and it must be equal to zero,
- continue subtraction from this row (which is equal to zero) and subsequent rows again until we reach 10.
- Whenever we reach the end of a trip, the subtraction will also reset for a new trip.
- Repeat this procedure for all trips.
for example, in row 4, we have reached to threshold because the value in ‘cum’ column is 17, so, the next row in the ‘cum’ column must be 0 (but it is 19) and for row 6, we have to calculate the difference between timestamps in row 5, 6 that should be 2, not 19!
Advertisement
Answer
You can use a mask to reset the cumsum:
df['TimeDistance'] = df.groupby('trip')['timestamps'].diff(1) # get rows above threshold m = df['TimeDistance'].gt(10).groupby(df['trip']).shift(fill_value=False) df['cum'] = (df['TimeDistance'] .mask(m, 0) .groupby([df['trip'], m.cumsum()]) .cumsum() )
output:
trip timestamps TimeDistance cum 0 1 1235471761 NaN NaN 1 1 1235471763 2.0 2.0 2 1 1235471765 2.0 4.0 3 1 1235471767 2.0 6.0 4 1 1235471778 11.0 17.0 5 1 1235471780 2.0 0.0 6 1 1235471782 2.0 2.0 7 2 1235471784 NaN NaN 8 2 1235471786 2.0 2.0 9 2 1235471788 2.0 4.0 10 2 1235471820 32.0 36.0 11 2 1235471826 6.0 0.0 12 2 1235471829 3.0 3.0 13 3 1235471890 NaN NaN 14 3 1235471893 3.0 3.0 15 4 1235471894 NaN NaN 16 4 1235471896 2.0 2.0 17 5 1235471900 NaN NaN 18 5 1235471910 10.0 10.0 19 5 1235471912 2.0 12.0