Suppose I have next df N03_zero (date_code is already datetime):
item_code date_code 8028558104973 2022-01-01 8028558104973 2022-01-02 8028558104973 2022-01-03 8028558104973 2022-01-06 8028558104973 2022-01-07 7622300443269 2022-01-01 7622300443269 2022-01-10 7622300443269 2022-01-11 513082 2022-01-01 513082 2022-01-02 513082 2022-01-03
Millions of rows with date_code assigned to some item_code.
I am trying to get the number of days of each continuous period for each item_code, all other similar questions doesn’t helped me.
The expected df should be:
item_code continuous_days 8028558104973 3 8028558104973 2 7622300443269 1 7622300443269 2 513082 3
Once days sequence breaks, it should count days in this sequence and then start to count again.
The aim is, to able to get then the dataframe with count, min, max, and mean for each item_code.
Like this:
item_code no. periods min max mean 8028558104973 2 2 3 2.5 7622300443269 2 1 2 1.5 513082 1 3 3 3
Any suggestions?
Advertisement
Answer
For consecutive days compare difference by Series.diff in days by Series.dt.days for not equal 1 by Series.ne with cumulative sum by Series.cumsum and then use GroupBy.size, remove second level by DataFrame.droplevel and create DataFrame:
df['date_code'] = pd.to_datetime(df['date_code'])
df1= (df.groupby(['item_code',df['date_code'].diff().dt.days.ne(1).cumsum()], sort=False)
.size()
.droplevel(1)
.reset_index(name='continuous_days'))
print (df1)
item_code continuous_days
0 8028558104973 3
1 8028558104973 2
2 7622300443269 1
3 7622300443269 2
4 513082 3
And then aggregate values by named aggregations by GroupBy.agg:
df2 = (df1.groupby('item_code', sort=False, as_index=False)
.agg(**{'no. periods': ('continuous_days','size'),
'min':('continuous_days','min'),
'max':('continuous_days','max'),
'mean':('continuous_days','mean')}))
print (df2)
item_code no. periods min max mean
0 8028558104973 2 2 3 2.5
1 7622300443269 2 1 2 1.5
2 513082 1 3 3 3.0