I have a dataframe of houses, one row per house, that looks like this:
data = [ ['Oxford', 2016, True], ['Oxford', 2016, True], ['Oxford', 2018, False], ['Cambridge', 2016, False], ['Cambridge', 2016, True] ] df = pd.DataFrame(data, columns=['town', 'year', 'is_detached'])
town year is_detached 0 Oxford 2016 True 1 Oxford 2016 True 2 Oxford 2018 False 3 Cambridge 2016 False 4 Cambridge 2016 True
And I want to end up with a table that looks like this:
town total_houses_2016 total_houses_2018 is_detached_2016 is_detached_2018 0 Oxford 2 1 2 0 1 Cambridge 2 0 1 0
Currently I’m doing two separate groupby calls, and then joining them together:
by_town_totals = df.groupby([df.town, df.year]) .size() .reset_index() .pivot(index=["town"], columns="year", values=0).fillna(0) .add_prefix('total_houses_') by_town_detached = df.groupby([df.town, df.year]) .is_detached.sum().reset_index() .pivot(index=["town"], columns="year", values="is_detached").fillna(0) .add_prefix('is_detached_') by_town = pd.concat([by_town_totals, by_town_detached], axis=1).reset_index()
Is there a way I could do this with a single groupby?
Advertisement
Answer
df.year = df.year.astype(str) df = df.pivot_table(index='town', columns='year', values='is_detached', aggfunc=['size', 'sum'], fill_value=0) df.columns = (df.columns.to_flat_index() .str.join('_') .str.replace('size','total_houses') .str.replace('sum', 'is_detached')) print(df.reset_index())
Output:
town total_houses_2016 total_houses_2018 is_detached_2016 is_detached_2018 0 Cambridge 2 0 1 0 1 Oxford 2 1 2 0