I have time series of y per store and product stored in the following dataframe:
ds store product y 0 2016-01-01 a salt 2 1 2016-01-02 a salt 5 2 2016-01-04 a salt 3 3 2016-01-05 a salt 3 4 2016-01-06 a salt 4 5 2016-01-07 a salt 3 6 2016-01-01 b pepper 2 7 2016-01-02 b pepper 2 8 2016-01-03 b pepper 1 9 2016-01-04 b pepper 2 10 2016-01-06 b pepper 4 11 2016-01-07 b pepper 2
I would like to output all the missing dates per store, product and return the following result:
ds store product 0 2016-01-03 a salt 1 2016-01-05 b pepper
Advertisement
Answer
Use groupby_resample:
# Assuming ds is datetime64 else use:
# df['ds'] = pd.to_datetime(df['ds'])
out = df.groupby(['store', 'product']).resample('D', on='ds')['y'] 
        .first().loc[lambda x: x.isna()].index.to_frame(index=False)
print(out)
# Output
  store product         ds
0     a    salt 2016-01-03
1     b  pepper 2016-01-05
Details:
>>> df.groupby(['store', 'product']).resample('D', on='ds')['y'].first()
store  product  ds        
a      salt     2016-01-01    2.0
                2016-01-02    5.0
                2016-01-03    NaN  # <- missing value == missing date
                2016-01-04    3.0
                2016-01-05    3.0
                2016-01-06    4.0
                2016-01-07    3.0
b      pepper   2016-01-01    2.0
                2016-01-02    2.0
                2016-01-03    1.0
                2016-01-04    2.0
                2016-01-05    NaN  # <- missing value == missing date
                2016-01-06    4.0
                2016-01-07    2.0
Name: y, dtype: float64
Update: If you have a date in the ds column without a value in the y column, just use fillna({'y': 0}) before groupby_resample