I am trying to count the number of days between dates (cumulatively), (grouped by a column denoted as id), however, I want to reset the counter whenever a condition is satisfied.
I want to at the same time create a new column and add the values to that column for those particular rows. Additionally, I want to also count back the reset point, denoting negative days.
Currently, I have tried this:
import pandas as pd
import numpy as np
df = pd.DataFrame({'reset':['N','N','Y','N','N','Y','Y','Y','Y','Y', 'Y'],
'category':['low','low','low','low','low','medium','high','high','medium','medium', 'medium'],
'date':['2019-09-04','2019-09-05','2019-09-06','2019-09-07','2019-09-08','2021-05-23','2021-05-23','2021-05-23','2021-05-23','2021-05-23', '2021-05-22'],
'id':[16860,16860,16860,16860,16860,17611,23409,21765,19480,9166, 9166]
})
df['date'] = pd.to_datetime(df['date'], format='%Y/%m/%d')
df = df.sort_values(['id','date'])
#create extra grouping column based on reset day
df['group'] = df['reset'].replace({'N':False,'Y':True})
df['group'] = df.groupby('id')['group'].cumsum()
df['tdelta'] = df.groupby(['id','group'])['date'].diff() / np.timedelta64(1, 'D')
df['tdelta'] = df.groupby(['id','group'])['tdelta'].cumsum().fillna(0)
df = df.sort_values(by='date', ascending=False)
df['tdelta reverse'] = df.groupby(['id','group'])['date'].diff() / np.timedelta64(1, 'D')
df['tdelta reverse'] = df.groupby(['id','group'])['tdelta reverse'].cumsum().fillna(0)
df = df.sort_values(['id','date'])
print(df)
which produces this:
reset category date id group tdelta tdelta reverse
10 Y medium 2021-05-22 9166 1.0 0.0 0.0
9 Y medium 2021-05-23 9166 2.0 0.0 0.0
0 N low 2019-09-04 16860 0.0 0.0 -1.0
1 N low 2019-09-05 16860 0.0 1.0 0.0
2 Y low 2019-09-06 16860 1.0 0.0 -2.0
3 N low 2019-09-07 16860 1.0 1.0 -1.0
4 N low 2019-09-08 16860 1.0 2.0 0.0
5 Y medium 2021-05-23 17611 1.0 0.0 0.0
8 Y medium 2021-05-23 19480 1.0 0.0 0.0
7 Y high 2021-05-23 21765 1.0 0.0 0.0
6 Y high 2021-05-23 23409 1.0 0.0 0.0
Now, that I have added “tdelta reverse”, here is a clearer example (with different data) of what I want the data frame to look like in the end result:
reset category date id tdelta1 tdelta2 tdelta3 tdelta# ...
N medium 22/05/2021 16860 -4
N medium 23/05/2021 16860 -3
N medium 24/05/2021 16860 -2
N medium 25/05/2021 16860 -1
Y medium 26/05/2021 16860 0
N medium 27/05/2021 16860 1 -4
N medium 28/05/2021 16860 2 -3
N medium 29/05/2021 16860 3 -2
N medium 30/05/2021 16860 4 -1
Y medium 31/05/2021 16860 0
N medium 01/06/2021 16860 1 -3
N medium 02/06/2021 16860 2 -2
N medium 03/06/2021 16860 3 -1
Y medium 04/06/2021 16860 0
N medium 05/06/2021 16860 1
N medium 06/06/2021 16860 2
N medium 07/06/2021 16860 3
N medium 08/06/2021 16860 4
Essentially, a new ‘tdelta#’ column should be created for each group, where we get the ‘tdelta reverse’ values until a reset point and the ‘tdelta’ values afterwards (for each group).
As a side note, if an id does not have several groups (reset points), it is ok to not fill in these additional ‘tdelta#’ columns.
At the moment, I am creating new columns and filling them with the ‘tdelta’ values:
for group in df['group'].unique():
df[f'tdelta{int(group)}'] = df[(df.group == group)]['tdelta']
However, I also need to add the ‘tdelta reverse’ values so it looks like my end example.
I’m thinking that I should perhaps use iloc with groupby and/or do some splicing?
Any suggestions on how I can tackle this?
Advertisement
Answer
So I have solved it (albeit with an ad hoc method in my opinion) by adding a pandas combine_first
function that combines non nan values from both columns as seen in the try
and except
statement lower down in the code below:
# defined a new df for clearer output
df = pd.DataFrame({'reset':['N','Y','N','N','N','Y','N','N','Y','N','N'],
'category':['low','low','low','low','low','low','low','low','low','low', 'low'],
'date':['2019-09-04','2020-11-06','2020-11-06','2019-09-07','2019-11-08','2021-05-21','2021-06-23','2021-07-24','2021-08-25','2021-09-23', '2021-10-21'],
'id':[16860,16860,16860,16860,16860,16860,16860,16860,16860,16860, 16860]
})
df['date'] = pd.to_datetime(df['date'], format='%Y/%m/%d')
df = df.sort_values(['id','date'])
#create extra grouping column based on reset day
df['group'] = df['reset'].replace({'N':False,'Y':True})
df['group'] = df.groupby('id')['group'].cumsum()
df['tdelta'] = df.groupby(['id','group'])['date'].diff() / np.timedelta64(1, 'D')
df['tdelta'] = df.groupby(['id','group'])['tdelta'].cumsum().fillna(0)
df = df.sort_values(by='date', ascending=False)
df['tdelta reverse'] = df.groupby(['id','group'])['date'].diff() / np.timedelta64(1, 'D')
df['tdelta reverse'] = df.groupby(['id','group'])['tdelta reverse'].cumsum().fillna(0)
# the problem solved via combine_first which combines the non nan values from both columns
df = df.sort_values(['id','date'])
for group in df['group'].unique():
group_minus_1 = group - 1.0
try:
df[f'tdelta{int(group)}'] = df[(df['group'] == group)]['tdelta']
df[f'tdelta{int(group)}'] = df[f'tdelta{int(group)}'].combine_first(df[(df['group'] == group_minus_1)]['tdelta reverse'])
except:
continue
#print(df)
This is the output:
reset category date id group tdelta tdelta reverse tdelta0 tdelta1 tdelta2 tdelta3
0 N low 2019-09-04 16860 0.0 NaN -65.0 0.0 -65.0 NaN NaN
3 N low 2019-09-07 16860 0.0 NaN -62.0 3.0 -62.0 NaN NaN
4 N low 2019-11-08 16860 0.0 NaN 0.0 65.0 0.0 NaN NaN
1 Y low 2020-11-06 16860 1.0 250.0 0.0 NaN 0.0 0.0 NaN
2 N low 2020-11-06 16860 1.0 250.0 0.0 NaN 0.0 0.0 NaN
5 Y low 2021-05-21 16860 2.0 250.0 -64.0 NaN NaN 0.0 -64.0
6 N low 2021-06-23 16860 2.0 NaN -31.0 NaN NaN 33.0 -31.0
7 N low 2021-07-24 16860 2.0 NaN 0.0 NaN NaN 64.0 0.0
8 Y low 2021-08-25 16860 3.0 250.0 -57.0 NaN NaN NaN 0.0
9 N low 2021-09-23 16860 3.0 NaN -28.0 NaN NaN NaN 29.0
10 N low 2021-10-21 16860 3.0 NaN 0.0 NaN NaN NaN 57.0