Skip to content
Advertisement

Count number of days in each continuous period pandas

Suppose I have next df N03_zero (date_code is already datetime):

item_code       date_code
8028558104973   2022-01-01
8028558104973   2022-01-02
8028558104973   2022-01-03
8028558104973   2022-01-06
8028558104973   2022-01-07
7622300443269   2022-01-01
7622300443269   2022-01-10
7622300443269   2022-01-11
513082          2022-01-01
513082          2022-01-02
513082          2022-01-03

Millions of rows with date_code assigned to some item_code. I am trying to get the number of days of each continuous period for each item_code, all other similar questions doesn’t helped me. The expected df should be:

item_code       continuous_days
8028558104973   3
8028558104973   2
7622300443269   1
7622300443269   2
513082          3

Once days sequence breaks, it should count days in this sequence and then start to count again. The aim is, to able to get then the dataframe with count, min, max, and mean for each item_code.

Like this:

item_code       no. periods   min   max   mean
8028558104973   2             2     3     2.5
7622300443269   2             1     2     1.5
513082          1             3     3     3

Any suggestions?

Advertisement

Answer

For consecutive days compare difference by Series.diff in days by Series.dt.days for not equal 1 by Series.ne with cumulative sum by Series.cumsum and then use GroupBy.size, remove second level by DataFrame.droplevel and create DataFrame:

df['date_code'] = pd.to_datetime(df['date_code'])

df1= (df.groupby(['item_code',df['date_code'].diff().dt.days.ne(1).cumsum()], sort=False)
        .size()
        .droplevel(1)
        .reset_index(name='continuous_days'))
print (df1)
       item_code  continuous_days
0  8028558104973                3
1  8028558104973                2
2  7622300443269                1
3  7622300443269                2
4         513082                3

And then aggregate values by named aggregations by GroupBy.agg:

df2 = (df1.groupby('item_code', sort=False, as_index=False)
          .agg(**{'no. periods': ('continuous_days','size'),
                 'min':('continuous_days','min'),
                 'max':('continuous_days','max'),
                 'mean':('continuous_days','mean')}))
print (df2)
       item_code  no. periods  min  max  mean
0  8028558104973            2    2    3   2.5
1  7622300443269            2    1    2   1.5
2         513082            1    3    3   3.0
User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement