Suppose I have next df N03_zero
(date_code
is already datetime
):
item_code date_code 8028558104973 2022-01-01 8028558104973 2022-01-02 8028558104973 2022-01-03 8028558104973 2022-01-06 8028558104973 2022-01-07 7622300443269 2022-01-01 7622300443269 2022-01-10 7622300443269 2022-01-11 513082 2022-01-01 513082 2022-01-02 513082 2022-01-03
Millions of rows with date_code
assigned to some item_code
.
I am trying to get the number of days of each continuous period for each item_code
, all other similar questions doesn’t helped me.
The expected df should be:
item_code continuous_days 8028558104973 3 8028558104973 2 7622300443269 1 7622300443269 2 513082 3
Once days sequence breaks, it should count days in this sequence and then start to count again.
The aim is, to able to get then the dataframe with count
, min
, max
, and mean
for each item_code
.
Like this:
item_code no. periods min max mean 8028558104973 2 2 3 2.5 7622300443269 2 1 2 1.5 513082 1 3 3 3
Any suggestions?
Advertisement
Answer
For consecutive days compare difference by Series.diff
in days by Series.dt.days
for not equal 1
by Series.ne
with cumulative sum by Series.cumsum
and then use GroupBy.size
, remove second level by DataFrame.droplevel
and create DataFrame
:
df['date_code'] = pd.to_datetime(df['date_code']) df1= (df.groupby(['item_code',df['date_code'].diff().dt.days.ne(1).cumsum()], sort=False) .size() .droplevel(1) .reset_index(name='continuous_days')) print (df1) item_code continuous_days 0 8028558104973 3 1 8028558104973 2 2 7622300443269 1 3 7622300443269 2 4 513082 3
And then aggregate values by named aggregations by GroupBy.agg
:
df2 = (df1.groupby('item_code', sort=False, as_index=False) .agg(**{'no. periods': ('continuous_days','size'), 'min':('continuous_days','min'), 'max':('continuous_days','max'), 'mean':('continuous_days','mean')})) print (df2) item_code no. periods min max mean 0 8028558104973 2 2 3 2.5 1 7622300443269 2 1 2 1.5 2 513082 1 3 3 3.0