I am trying to count the number of days between dates (cumulatively), (grouped by a column denoted as id), however, I want to reset the counter whenever a condition is satisfied.
I want to at the same time create a new column and add the values to that column for those particular rows. Additionally, I want to also count back the reset point, denoting negative days.
Currently, I have tried this:
import pandas as pd import numpy as np df = pd.DataFrame({'reset':['N','N','Y','N','N','Y','Y','Y','Y','Y', 'Y'], 'category':['low','low','low','low','low','medium','high','high','medium','medium', 'medium'], 'date':['2019-09-04','2019-09-05','2019-09-06','2019-09-07','2019-09-08','2021-05-23','2021-05-23','2021-05-23','2021-05-23','2021-05-23', '2021-05-22'], 'id':[16860,16860,16860,16860,16860,17611,23409,21765,19480,9166, 9166] }) df['date'] = pd.to_datetime(df['date'], format='%Y/%m/%d') df = df.sort_values(['id','date']) #create extra grouping column based on reset day df['group'] = df['reset'].replace({'N':False,'Y':True}) df['group'] = df.groupby('id')['group'].cumsum() df['tdelta'] = df.groupby(['id','group'])['date'].diff() / np.timedelta64(1, 'D') df['tdelta'] = df.groupby(['id','group'])['tdelta'].cumsum().fillna(0) df = df.sort_values(by='date', ascending=False) df['tdelta reverse'] = df.groupby(['id','group'])['date'].diff() / np.timedelta64(1, 'D') df['tdelta reverse'] = df.groupby(['id','group'])['tdelta reverse'].cumsum().fillna(0) df = df.sort_values(['id','date']) print(df)
which produces this:
reset category date id group tdelta tdelta reverse 10 Y medium 2021-05-22 9166 1.0 0.0 0.0 9 Y medium 2021-05-23 9166 2.0 0.0 0.0 0 N low 2019-09-04 16860 0.0 0.0 -1.0 1 N low 2019-09-05 16860 0.0 1.0 0.0 2 Y low 2019-09-06 16860 1.0 0.0 -2.0 3 N low 2019-09-07 16860 1.0 1.0 -1.0 4 N low 2019-09-08 16860 1.0 2.0 0.0 5 Y medium 2021-05-23 17611 1.0 0.0 0.0 8 Y medium 2021-05-23 19480 1.0 0.0 0.0 7 Y high 2021-05-23 21765 1.0 0.0 0.0 6 Y high 2021-05-23 23409 1.0 0.0 0.0
Now, that I have added “tdelta reverse”, here is a clearer example (with different data) of what I want the data frame to look like in the end result:
reset category date id tdelta1 tdelta2 tdelta3 tdelta# ... N medium 22/05/2021 16860 -4 N medium 23/05/2021 16860 -3 N medium 24/05/2021 16860 -2 N medium 25/05/2021 16860 -1 Y medium 26/05/2021 16860 0 N medium 27/05/2021 16860 1 -4 N medium 28/05/2021 16860 2 -3 N medium 29/05/2021 16860 3 -2 N medium 30/05/2021 16860 4 -1 Y medium 31/05/2021 16860 0 N medium 01/06/2021 16860 1 -3 N medium 02/06/2021 16860 2 -2 N medium 03/06/2021 16860 3 -1 Y medium 04/06/2021 16860 0 N medium 05/06/2021 16860 1 N medium 06/06/2021 16860 2 N medium 07/06/2021 16860 3 N medium 08/06/2021 16860 4
Essentially, a new ‘tdelta#’ column should be created for each group, where we get the ‘tdelta reverse’ values until a reset point and the ‘tdelta’ values afterwards (for each group).
As a side note, if an id does not have several groups (reset points), it is ok to not fill in these additional ‘tdelta#’ columns.
At the moment, I am creating new columns and filling them with the ‘tdelta’ values:
for group in df['group'].unique(): df[f'tdelta{int(group)}'] = df[(df.group == group)]['tdelta']
However, I also need to add the ‘tdelta reverse’ values so it looks like my end example.
I’m thinking that I should perhaps use iloc with groupby and/or do some splicing?
Any suggestions on how I can tackle this?
Advertisement
Answer
So I have solved it (albeit with an ad hoc method in my opinion) by adding a pandas combine_first
function that combines non nan values from both columns as seen in the try
and except
statement lower down in the code below:
# defined a new df for clearer output df = pd.DataFrame({'reset':['N','Y','N','N','N','Y','N','N','Y','N','N'], 'category':['low','low','low','low','low','low','low','low','low','low', 'low'], 'date':['2019-09-04','2020-11-06','2020-11-06','2019-09-07','2019-11-08','2021-05-21','2021-06-23','2021-07-24','2021-08-25','2021-09-23', '2021-10-21'], 'id':[16860,16860,16860,16860,16860,16860,16860,16860,16860,16860, 16860] }) df['date'] = pd.to_datetime(df['date'], format='%Y/%m/%d') df = df.sort_values(['id','date']) #create extra grouping column based on reset day df['group'] = df['reset'].replace({'N':False,'Y':True}) df['group'] = df.groupby('id')['group'].cumsum() df['tdelta'] = df.groupby(['id','group'])['date'].diff() / np.timedelta64(1, 'D') df['tdelta'] = df.groupby(['id','group'])['tdelta'].cumsum().fillna(0) df = df.sort_values(by='date', ascending=False) df['tdelta reverse'] = df.groupby(['id','group'])['date'].diff() / np.timedelta64(1, 'D') df['tdelta reverse'] = df.groupby(['id','group'])['tdelta reverse'].cumsum().fillna(0) # the problem solved via combine_first which combines the non nan values from both columns df = df.sort_values(['id','date']) for group in df['group'].unique(): group_minus_1 = group - 1.0 try: df[f'tdelta{int(group)}'] = df[(df['group'] == group)]['tdelta'] df[f'tdelta{int(group)}'] = df[f'tdelta{int(group)}'].combine_first(df[(df['group'] == group_minus_1)]['tdelta reverse']) except: continue #print(df)
This is the output:
reset category date id group tdelta tdelta reverse tdelta0 tdelta1 tdelta2 tdelta3 0 N low 2019-09-04 16860 0.0 NaN -65.0 0.0 -65.0 NaN NaN 3 N low 2019-09-07 16860 0.0 NaN -62.0 3.0 -62.0 NaN NaN 4 N low 2019-11-08 16860 0.0 NaN 0.0 65.0 0.0 NaN NaN 1 Y low 2020-11-06 16860 1.0 250.0 0.0 NaN 0.0 0.0 NaN 2 N low 2020-11-06 16860 1.0 250.0 0.0 NaN 0.0 0.0 NaN 5 Y low 2021-05-21 16860 2.0 250.0 -64.0 NaN NaN 0.0 -64.0 6 N low 2021-06-23 16860 2.0 NaN -31.0 NaN NaN 33.0 -31.0 7 N low 2021-07-24 16860 2.0 NaN 0.0 NaN NaN 64.0 0.0 8 Y low 2021-08-25 16860 3.0 250.0 -57.0 NaN NaN NaN 0.0 9 N low 2021-09-23 16860 3.0 NaN -28.0 NaN NaN NaN 29.0 10 N low 2021-10-21 16860 3.0 NaN 0.0 NaN NaN NaN 57.0