Skip to content
Advertisement

Pandas fill missing dates and values simultaneously for each group

I have a dataframe (mydf) with dates for each group in monthly frequency like below:

Dt          Id  Sales
2021-03-01  B   2
2021-04-01  B   42
2021-05-01  B   20
2021-06-01  B   4
2020-10-01  A   47
2020-11-01  A   67
2020-12-01  A   46

I want to fill the dt for each group till the Maximum date within the date column starting from the date of Id while simultaneously filling in 0 for the Sales column. So each group starts at their own start date but ends at the same end date.

So for e.g. ID=A will start from 2020-10-01 and go all the way to 2021-06-03 and the value for the filled dates will be 0.

So the output will be

Dt          Id  Sales
2021-03-01  B   2
2021-04-01  B   42
2021-05-01  B   20
2021-06-01  B   4
2020-10-01  A   46
2020-11-01  A   47
2020-12-01  A   67
2021-01-01  A   0
2021-02-01  A   0
2021-03-01  A   0
2021-04-01  A   0
2021-05-01  A   0
2021-06-01  A   0

I have tried reindex but instead of adding daterange manually I want to use the dates in the groups.

My code is :

f = lambda x: x.reindex(pd.date_range('2020-10-01', '2021-06-01', freq='MS', name='Dt'))
mydf = mydf.set_index('Dt').groupby('Id').apply(f).drop('Id', axis=1).fillna(0)
mydf = mydf.reset_index()

Advertisement

Answer

Let’s try:

  1. Getting the minimum value per group using groupby.min
  2. Add a new column to the aggregated mins called max which stores the maximum values from the frame using Series.max on Dt
  3. Create individual date_range per group based on the min and max values
  4. Series.explode into rows to have a DataFrame that represents the new index.
  5. Create a MultiIndex.from_frame to reindex the DataFrame with.
  6. reindex with midx and set the fillvalue=0
# Get Min Per Group
dates = mydf.groupby('Id')['Dt'].min().to_frame(name='min')
# Get max from Frame
dates['max'] = mydf['Dt'].max()

# Create MultiIndex with separate Date ranges per Group
midx = pd.MultiIndex.from_frame(
    dates.apply(
        lambda x: pd.date_range(x['min'], x['max'], freq='MS'), axis=1
    ).explode().reset_index(name='Dt')[['Dt', 'Id']]
)

# Reindex
mydf = (
    mydf.set_index(['Dt', 'Id'])
        .reindex(midx, fill_value=0)
        .reset_index()
)

mydf:

           Dt Id  Sales
0  2020-10-01  A     47
1  2020-11-01  A     67
2  2020-12-01  A     46
3  2021-01-01  A      0
4  2021-02-01  A      0
5  2021-03-01  A      0
6  2021-04-01  A      0
7  2021-05-01  A      0
8  2021-06-01  A      0
9  2021-03-01  B      2
10 2021-04-01  B     42
11 2021-05-01  B     20
12 2021-06-01  B      4

DataFrame:

import pandas as pd

mydf = pd.DataFrame({
    'Dt': ['2021-03-01', '2021-04-01', '2021-05-01', '2021-06-01', '2020-10-01',
           '2020-11-01', '2020-12-01'],
    'Id': ['B', 'B', 'B', 'B', 'A', 'A', 'A'],
    'Sales': [2, 42, 20, 4, 47, 67, 46]
})
mydf['Dt'] = pd.to_datetime(mydf['Dt'])
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement