I have a pandas DataFrame exclusively with dates:
year month day date 0 2019 01 17 01-2019 1 2019 01 17 01-2019 2 2019 01 18 01-2019 3 2019 01 19 01-2019 4 2019 01 20 01-2019 ... 336 2021 12 31 12-2021 337 2022 03 02 03-2022 338 2022 03 05 03-2022 339 2022 05 25 05-2022 340 2022 06 09 06-2022
Using groupby I get a count for the number of monthly occurrences as seen below:
gh =df.groupby([df['year'], df['month']]).agg({'count'}).reset_index()
print(gh)
>>>  year month   day  date
               count count
0   2019    01    10    10
1   2019    02     6     6
2   2019    03     8     8
3   2019    04     2     2
4   2019    05     7     7
5   2019    06     8     8
6   2019    07    10    10
7   2019    08     6     6
8   2019    09     6     6
9   2019    10     6     6
10  2019    11     4     4
11  2019    12     3     3
12  2020    01    12    12
13  2020    02     6     6
14  2020    03    22    22
15  2020    04    17    17
16  2020    05     4     4
17  2020    06     9     9
18  2020    07     6     6
19  2020    08     8     8
20  2020    09     4     4
21  2020    10     7     7
22  2020    11    15    15
23  2020    12    15    15
24  2021    01    18    18
25  2021    02    22    22
26  2021    03    15    15
27  2021    04    19    19
28  2021    05    16    16
29  2021    06    23    23
30  2021    07    19    19
31  2021    08     1     1
32  2021    12     3     3
33  2022    03     2     2
34  2022    05     1     1
35  2022    06     1     1
(date is only used for plotting reasons). My issue is, come 09-2021 I have zero monthly counts and I want to obtain my gh dataframe such that the missing rows look something like:
31 2021 08 1 1 32 2021 09 0 0 33 2021 10 0 0 34 2021 11 0 0 35 2021 12 0 0 ...
All the way through 06-2022.
I encounter errors when I try using gh.reindex(pd.period_range(gh.index[0], gh.index[-1], freq='M')) from this solution, as the monthly indexes repeat. I also think because I’m only working with dates as my data and not actual variables, this is messing things up, but i am trying to plot each month from 01/2019 – 06/2022 including the 0 counts for months in late 2021 and early 2022. How do I make this work?
I know there are a lot of threads similar to this, but they all use actual data counts, and not just date counts.
Edit: Here is the output from df.info():
<class 'pandas.core.frame.DataFrame'> RangeIndex: 341 entries, 0 to 340 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 year 341 non-null object 1 month 341 non-null object 2 day 341 non-null object 3 date 341 non-null object dtypes: object(4) memory usage: 10.8+ KB
Advertisement
Answer
Create a MultiIndex then reindex after groupby_agg:
mi = pd.MultiIndex.from_product([df['year'].unique(), range(1, 13)],
                                names=['year', 'month'])
gh = (df.groupby([df['year'], df['month']]).agg({'count'})
        .reindex(mi, fill_value=0).reset_index()
        .droplevel(level=1, axis=1)))
Output:
>>> gh
    year  month  day  date
0   2019      1    5     5
1   2019      2    0     0
2   2019      3    0     0
3   2019      4    0     0
4   2019      5    0     0
5   2019      6    0     0
6   2019      7    0     0
7   2019      8    0     0
8   2019      9    0     0
9   2019     10    0     0
10  2019     11    0     0
11  2019     12    0     0
12  2021      1    0     0
13  2021      2    0     0
14  2021      3    0     0
15  2021      4    0     0
16  2021      5    0     0
17  2021      6    0     0
18  2021      7    0     0
19  2021      8    0     0
20  2021      9    0     0
21  2021     10    0     0
22  2021     11    0     0
23  2021     12    1     1
24  2022      1    0     0
25  2022      2    0     0
26  2022      3    2     2
27  2022      4    0     0
28  2022      5    1     1
29  2022      6    1     1
30  2022      7    0     0
31  2022      8    0     0
32  2022      9    0     0
33  2022     10    0     0
34  2022     11    0     0
35  2022     12    0     0