I am trying to aggregate a dataframe using pandas.pivot_table and find it behaves differently when multiple lines are aggregated on a categorical series.
Code from this issue helps explain (though the issue is different from mine).
Setting up a dataframe with a categorical column:
import pandas as pd
stations = ['Kings Cross Station', 'Newtown Station', 'Parramatta Station',
'Town Hall Station', 'Central Station', 'Circular Quay Station',
'Martin Place Station', 'Museum Station', 'St James Station',
'Bondi Junction Station', 'North Sydney Station']
df1 = pd.DataFrame({'Station': ['Kings Cross Station', 'Newtown Station', 'Parramatta Station',
'Kings Cross Station', 'Newtown Station', 'Parramatta Station',
'Kings Cross Station', 'Newtown Station', 'Parramatta Station'],
'Date': pd.DatetimeIndex(['1/1/2017', '1/1/2017', '1/1/2017',
'2/1/2017', '2/1/2017', '2/1/2017',
'3/1/2017', '3/1/2017', '3/1/2017',]),
'Exit': range(0, 9)})
df1.Station = df1.Station.astype(pd.CategoricalDtype(stations, ordered=True))
If I pivot the dataframe with
df1.pivot_table(index = 'Date', columns= 'Station', values = 'Exit',
dropna=False, observed=False, aggfunc=len, fill_value=0)
I get a dataframe with all the categories with the stations that had no data in the dataframe as columns filled with 0s, which is what I want:
Station Kings Cross Station North Sydney Station
Date
2017-01-01 1 0
2017-02-01 1 0
2017-03-01 1 0
[3 rows x 11 columns]
However, if I add some rows with repeated values:
df2 = pd.DataFrame({'Station': ['Kings Cross Station', 'Newtown Station', 'Parramatta Station',
'Kings Cross Station', 'Newtown Station', 'Parramatta Station'],
'Date': pd.DatetimeIndex(['1/1/2017', '1/1/2017', '1/1/2017',
'2/1/2017', '2/1/2017', '2/1/2017']),
'Exit': range(0, 6)})
df3 = pd.concat([df1, df2])
… and pivot
df3.pivot_table(index = 'Date', columns= 'Station', values = 'Exit',
dropna=False, observed=False, aggfunc=len, fill_value=0)
Now the stations not represented in df3
do not apear in the pivot:
Station Kings Cross Station Newtown Station Parramatta Station
Date
2017-01-01 2 2 2
2017-02-01 2 2 2
2017-03-01 1 1 1
I can add the missing categories by iterating over the categories and add a column of 0s if not in the pivot table, but it should be done with pandas, surely?!
I hope that is clear, first question! Thank you
Advertisement
Answer
df1.dtypes
Station category
Date datetime64[ns]
Exit int64
dtype: object
df2.dtypes
Station object
Date datetime64[ns]
Exit int64
dtype: object
This is because df2.Station
is not a Category yet. You must apply the same transformation as you did for df1
to df2
for the pivot to work.
Adding this line before your concat should resolve the problem:
df2.Station = df2.Station.astype(pd.CategoricalDtype(stations, ordered=True))