I have a dataframe (mydf) with dates for each group in monthly frequency like below:
Dt Id Sales 2021-03-01 B 2 2021-04-01 B 42 2021-05-01 B 20 2021-06-01 B 4 2020-10-01 A 47 2020-11-01 A 67 2020-12-01 A 46
I want to fill the dt for each group till the Maximum date within the date column starting from the date of Id while simultaneously filling in 0 for the Sales column. So each group starts at their own start date but ends at the same end date.
So for e.g. ID=A will start from 2020-10-01 and go all the way to 2021-06-03 and the value for the filled dates will be 0.
So the output will be
Dt Id Sales 2021-03-01 B 2 2021-04-01 B 42 2021-05-01 B 20 2021-06-01 B 4 2020-10-01 A 46 2020-11-01 A 47 2020-12-01 A 67 2021-01-01 A 0 2021-02-01 A 0 2021-03-01 A 0 2021-04-01 A 0 2021-05-01 A 0 2021-06-01 A 0
I have tried reindex but instead of adding daterange manually I want to use the dates in the groups.
My code is :
f = lambda x: x.reindex(pd.date_range('2020-10-01', '2021-06-01', freq='MS', name='Dt')) mydf = mydf.set_index('Dt').groupby('Id').apply(f).drop('Id', axis=1).fillna(0) mydf = mydf.reset_index()
Advertisement
Answer
Let’s try:
- Getting the minimum value per group using
groupby.min
- Add a new column to the aggregated mins called
max
which stores the maximum values from the frame usingSeries.max
onDt
- Create individual
date_range
per group based on themin
andmax
values Series.explode
into rows to have a DataFrame that represents the new index.- Create a
MultiIndex.from_frame
toreindex
the DataFrame with. reindex
withmidx
and set thefillvalue=0
# Get Min Per Group dates = mydf.groupby('Id')['Dt'].min().to_frame(name='min') # Get max from Frame dates['max'] = mydf['Dt'].max() # Create MultiIndex with separate Date ranges per Group midx = pd.MultiIndex.from_frame( dates.apply( lambda x: pd.date_range(x['min'], x['max'], freq='MS'), axis=1 ).explode().reset_index(name='Dt')[['Dt', 'Id']] ) # Reindex mydf = ( mydf.set_index(['Dt', 'Id']) .reindex(midx, fill_value=0) .reset_index() )
mydf
:
Dt Id Sales 0 2020-10-01 A 47 1 2020-11-01 A 67 2 2020-12-01 A 46 3 2021-01-01 A 0 4 2021-02-01 A 0 5 2021-03-01 A 0 6 2021-04-01 A 0 7 2021-05-01 A 0 8 2021-06-01 A 0 9 2021-03-01 B 2 10 2021-04-01 B 42 11 2021-05-01 B 20 12 2021-06-01 B 4
DataFrame:
import pandas as pd mydf = pd.DataFrame({ 'Dt': ['2021-03-01', '2021-04-01', '2021-05-01', '2021-06-01', '2020-10-01', '2020-11-01', '2020-12-01'], 'Id': ['B', 'B', 'B', 'B', 'A', 'A', 'A'], 'Sales': [2, 42, 20, 4, 47, 67, 46] }) mydf['Dt'] = pd.to_datetime(mydf['Dt'])