Skip to content
Advertisement

How do I reverse a cumulative count from a specific point based on a condition and then resume the count in a pandas data frame?

I am trying to count the number of days between dates (cumulatively), (grouped by a column denoted as id), however, I want to reset the counter whenever a condition is satisfied.

I want to at the same time create a new column and add the values to that column for those particular rows. Additionally, I want to also count back the reset point, denoting negative days.

Currently, I have tried this:

import pandas as pd
import numpy as np

df = pd.DataFrame({'reset':['N','N','Y','N','N','Y','Y','Y','Y','Y', 'Y'],
                   'category':['low','low','low','low','low','medium','high','high','medium','medium', 'medium'],
                   'date':['2019-09-04','2019-09-05','2019-09-06','2019-09-07','2019-09-08','2021-05-23','2021-05-23','2021-05-23','2021-05-23','2021-05-23', '2021-05-22'],
                   'id':[16860,16860,16860,16860,16860,17611,23409,21765,19480,9166, 9166]
                   })


df['date'] = pd.to_datetime(df['date'], format='%Y/%m/%d')
df = df.sort_values(['id','date'])

#create extra grouping column based on reset day
df['group'] = df['reset'].replace({'N':False,'Y':True})
df['group'] = df.groupby('id')['group'].cumsum()

df['tdelta'] = df.groupby(['id','group'])['date'].diff() / np.timedelta64(1, 'D')
df['tdelta'] = df.groupby(['id','group'])['tdelta'].cumsum().fillna(0)


df = df.sort_values(by='date', ascending=False)
df['tdelta reverse'] =  df.groupby(['id','group'])['date'].diff() / np.timedelta64(1, 'D')
df['tdelta reverse'] = df.groupby(['id','group'])['tdelta reverse'].cumsum().fillna(0)
df = df.sort_values(['id','date'])
print(df)

which produces this:

   reset category       date     id  group  tdelta  tdelta reverse
10     Y   medium 2021-05-22   9166    1.0     0.0             0.0
9      Y   medium 2021-05-23   9166    2.0     0.0             0.0
0      N      low 2019-09-04  16860    0.0     0.0            -1.0
1      N      low 2019-09-05  16860    0.0     1.0             0.0
2      Y      low 2019-09-06  16860    1.0     0.0            -2.0
3      N      low 2019-09-07  16860    1.0     1.0            -1.0
4      N      low 2019-09-08  16860    1.0     2.0             0.0
5      Y   medium 2021-05-23  17611    1.0     0.0             0.0
8      Y   medium 2021-05-23  19480    1.0     0.0             0.0
7      Y     high 2021-05-23  21765    1.0     0.0             0.0
6      Y     high 2021-05-23  23409    1.0     0.0             0.0

Now, that I have added “tdelta reverse”, here is a clearer example (with different data) of what I want the data frame to look like in the end result:

reset   category    date    id    tdelta1  tdelta2  tdelta3  tdelta# ...
    N   medium  22/05/2021  16860   -4      
    N   medium  23/05/2021  16860   -3      
    N   medium  24/05/2021  16860   -2      
    N   medium  25/05/2021  16860   -1      
    Y   medium  26/05/2021  16860    0      
    N   medium  27/05/2021  16860    1       -4 
    N   medium  28/05/2021  16860    2       -3 
    N   medium  29/05/2021  16860    3       -2 
    N   medium  30/05/2021  16860    4       -1 
    Y   medium  31/05/2021  16860             0 
    N   medium  01/06/2021  16860             1        -3
    N   medium  02/06/2021  16860             2        -2
    N   medium  03/06/2021  16860             3        -1
    Y   medium  04/06/2021  16860                       0
    N   medium  05/06/2021  16860                       1
    N   medium  06/06/2021  16860                       2
    N   medium  07/06/2021  16860                       3
    N   medium  08/06/2021  16860                       4

Essentially, a new ‘tdelta#’ column should be created for each group, where we get the ‘tdelta reverse’ values until a reset point and the ‘tdelta’ values afterwards (for each group).

As a side note, if an id does not have several groups (reset points), it is ok to not fill in these additional ‘tdelta#’ columns.

At the moment, I am creating new columns and filling them with the ‘tdelta’ values:

for group in df['group'].unique():
     df[f'tdelta{int(group)}'] = df[(df.group == group)]['tdelta']

However, I also need to add the ‘tdelta reverse’ values so it looks like my end example.

I’m thinking that I should perhaps use iloc with groupby and/or do some splicing?

Any suggestions on how I can tackle this?

Advertisement

Answer

So I have solved it (albeit with an ad hoc method in my opinion) by adding a pandas combine_first function that combines non nan values from both columns as seen in the try and except statement lower down in the code below:

# defined a new df for clearer output

df = pd.DataFrame({'reset':['N','Y','N','N','N','Y','N','N','Y','N','N'],
                   'category':['low','low','low','low','low','low','low','low','low','low', 'low'],
                   'date':['2019-09-04','2020-11-06','2020-11-06','2019-09-07','2019-11-08','2021-05-21','2021-06-23','2021-07-24','2021-08-25','2021-09-23', '2021-10-21'],
                   'id':[16860,16860,16860,16860,16860,16860,16860,16860,16860,16860, 16860]
                   })


df['date'] = pd.to_datetime(df['date'], format='%Y/%m/%d')
df = df.sort_values(['id','date'])

#create extra grouping column based on reset day
df['group'] = df['reset'].replace({'N':False,'Y':True})
df['group'] = df.groupby('id')['group'].cumsum()
df['tdelta'] = df.groupby(['id','group'])['date'].diff() / np.timedelta64(1, 'D')
df['tdelta'] = df.groupby(['id','group'])['tdelta'].cumsum().fillna(0)

df = df.sort_values(by='date', ascending=False)
df['tdelta reverse'] =  df.groupby(['id','group'])['date'].diff() / np.timedelta64(1, 'D') 
df['tdelta reverse'] = df.groupby(['id','group'])['tdelta reverse'].cumsum().fillna(0) 


# the problem solved via combine_first which combines the non nan values from both columns
df = df.sort_values(['id','date'])
    
for group in df['group'].unique():
    group_minus_1 = group - 1.0
    try:
        df[f'tdelta{int(group)}'] = df[(df['group'] == group)]['tdelta']
        df[f'tdelta{int(group)}'] = df[f'tdelta{int(group)}'].combine_first(df[(df['group'] == group_minus_1)]['tdelta reverse']) 
    except:
        continue

#print(df)


This is the output:

  reset category       date     id  group  tdelta  tdelta reverse  tdelta0  tdelta1  tdelta2  tdelta3
0      N      low 2019-09-04  16860    0.0     NaN           -65.0      0.0    -65.0      NaN      NaN
3      N      low 2019-09-07  16860    0.0     NaN           -62.0      3.0    -62.0      NaN      NaN
4      N      low 2019-11-08  16860    0.0     NaN             0.0     65.0      0.0      NaN      NaN
1      Y      low 2020-11-06  16860    1.0   250.0             0.0      NaN      0.0      0.0      NaN
2      N      low 2020-11-06  16860    1.0   250.0             0.0      NaN      0.0      0.0      NaN
5      Y      low 2021-05-21  16860    2.0   250.0           -64.0      NaN      NaN      0.0    -64.0
6      N      low 2021-06-23  16860    2.0     NaN           -31.0      NaN      NaN     33.0    -31.0
7      N      low 2021-07-24  16860    2.0     NaN             0.0      NaN      NaN     64.0      0.0
8      Y      low 2021-08-25  16860    3.0   250.0           -57.0      NaN      NaN      NaN      0.0
9      N      low 2021-09-23  16860    3.0     NaN           -28.0      NaN      NaN      NaN     29.0
10     N      low 2021-10-21  16860    3.0     NaN             0.0      NaN      NaN      NaN     57.0
User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement