Pandas pivot_table aggfunc ignores categories if more than one line of data is being aggregated

I am trying to aggregate a dataframe using pandas.pivot_table and find it behaves differently when multiple lines are aggregated on a categorical series.

Code from this issue helps explain (though the issue is different from mine).

Setting up a dataframe with a categorical column:

import pandas as pd

stations = ['Kings Cross Station', 'Newtown Station', 'Parramatta Station',
            'Town Hall Station', 'Central Station', 'Circular Quay Station', 
            'Martin Place Station', 'Museum Station', 'St James Station', 
            'Bondi Junction Station', 'North Sydney Station']

df1 = pd.DataFrame({'Station': ['Kings Cross Station', 'Newtown Station', 'Parramatta Station',
                                'Kings Cross Station', 'Newtown Station', 'Parramatta Station',
                                'Kings Cross Station', 'Newtown Station', 'Parramatta Station'],
                    'Date': pd.DatetimeIndex(['1/1/2017', '1/1/2017', '1/1/2017',
                                             '2/1/2017', '2/1/2017', '2/1/2017',
                                             '3/1/2017', '3/1/2017', '3/1/2017',]),
                    'Exit': range(0, 9)})

df1.Station = df1.Station.astype(pd.CategoricalDtype(stations, ordered=True))

JavaScript
​x
 
import pandas as pd
​
stations = ['Kings Cross Station', 'Newtown Station', 'Parramatta Station',
            'Town Hall Station', 'Central Station', 'Circular Quay Station', 
            'Martin Place Station', 'Museum Station', 'St James Station', 
            'Bondi Junction Station', 'North Sydney Station']
​
df1 = pd.DataFrame({'Station': ['Kings Cross Station', 'Newtown Station', 'Parramatta Station',
                                'Kings Cross Station', 'Newtown Station', 'Parramatta Station',
                                'Kings Cross Station', 'Newtown Station', 'Parramatta Station'],
                    'Date': pd.DatetimeIndex(['1/1/2017', '1/1/2017', '1/1/2017',
                                             '2/1/2017', '2/1/2017', '2/1/2017',
                                             '3/1/2017', '3/1/2017', '3/1/2017',]),
                    'Exit': range(0, 9)})
​
df1.Station = df1.Station.astype(pd.CategoricalDtype(stations, ordered=True))
​

If I pivot the dataframe with

df1.pivot_table(index = 'Date', columns= 'Station', values = 'Exit', 
                dropna=False, observed=False, aggfunc=len, fill_value=0)

JavaScript
 
df1.pivot_table(index = 'Date', columns= 'Station', values = 'Exit', 
                dropna=False, observed=False, aggfunc=len, fill_value=0)
​

I get a dataframe with all the categories with the stations that had no data in the dataframe as columns filled with 0s, which is what I want:

Station     Kings Cross Station  ...  North Sydney Station
Date                             ...                      
2017-01-01                    1  ...                     0
2017-02-01                    1  ...                     0
2017-03-01                    1  ...                     0
[3 rows x 11 columns]

JavaScript
 
Station     Kings Cross Station  ...  North Sydney Station
Date                             ...                      
2017-01-01                    1  ...                     0
2017-02-01                    1  ...                     0
2017-03-01                    1  ...                     0
[3 rows x 11 columns]
​

However, if I add some rows with repeated values:

df2 = pd.DataFrame({'Station': ['Kings Cross Station', 'Newtown Station', 'Parramatta Station',
                                'Kings Cross Station', 'Newtown Station', 'Parramatta Station'],
                    'Date': pd.DatetimeIndex(['1/1/2017', '1/1/2017', '1/1/2017',
                                             '2/1/2017', '2/1/2017', '2/1/2017']),
                    'Exit': range(0, 6)})
df3 = pd.concat([df1, df2])

JavaScript
 
df2 = pd.DataFrame({'Station': ['Kings Cross Station', 'Newtown Station', 'Parramatta Station',
                                'Kings Cross Station', 'Newtown Station', 'Parramatta Station'],
                    'Date': pd.DatetimeIndex(['1/1/2017', '1/1/2017', '1/1/2017',
                                             '2/1/2017', '2/1/2017', '2/1/2017']),
                    'Exit': range(0, 6)})
df3 = pd.concat([df1, df2])
​

… and pivot

df3.pivot_table(index = 'Date', columns= 'Station', values = 'Exit', 
                dropna=False, observed=False, aggfunc=len, fill_value=0)

JavaScript
 
df3.pivot_table(index = 'Date', columns= 'Station', values = 'Exit', 
                dropna=False, observed=False, aggfunc=len, fill_value=0)
​

Now the stations not represented in df3 do not apear in the pivot:

Station     Kings Cross Station  Newtown Station  Parramatta Station
Date                                                                
2017-01-01                    2                2                   2
2017-02-01                    2                2                   2
2017-03-01                    1                1                   1

JavaScript
 
Station     Kings Cross Station  Newtown Station  Parramatta Station
Date                                                                
2017-01-01                    2                2                   2
2017-02-01                    2                2                   2
2017-03-01                    1                1                   1
​

I can add the missing categories by iterating over the categories and add a column of 0s if not in the pivot table, but it should be done with pandas, surely?!

I hope that is clear, first question! Thank you

Answer

df1.dtypes
Station    category      
Date       datetime64[ns]
Exit       int64         
dtype: object

df2.dtypes
Station    object        
Date       datetime64[ns]
Exit       int64         
dtype: object

JavaScript
 
df1.dtypes
Station    category      
Date       datetime64[ns]
Exit       int64         
dtype: object
​
df2.dtypes
Station    object        
Date       datetime64[ns]
Exit       int64         
dtype: object
​

This is because df2.Station is not a Category yet. You must apply the same transformation as you did for df1 to df2 for the pivot to work.

Adding this line before your concat should resolve the problem:

df2.Station = df2.Station.astype(pd.CategoricalDtype(stations, ordered=True))

JavaScript
 
df2.Station = df2.Station.astype(pd.CategoricalDtype(stations, ordered=True))
​

Advertisement

Answer