How can I pivot a dataframe?

Question

What is pivot? How do I pivot? Long format to wide format? I&#8217;ve seen a lot of questions that ask about pivot tables, even if they don&#8217;t know it. It is virtually impossible to write a canonical question and answer that encompasses all aspects of pivoting&#8230; But I&#8217;m going to give it a go. …

Accepted Answer

Here is a list of idioms we can use to pivotpd.DataFrame.pivot_tableA glorified version of groupby with more intuitive API.  For many people, this is the preferred approach.  And it is the intended approach by the developers.Specify row level, column levels, values to be aggregated, and function(s) to perform aggregations.pd.DataFrame.groupby + pd.DataFrame.unstackGood general approach for doing just about any type of pivotYou specify all columns that will constitute the pivoted row levels and column levels in one group by.  You follow that by selecting the remaining columns you want to aggregate and the function(s) you want to perform the aggregation.  Finally, you unstack the levels that you want to be in the column index.pd.DataFrame.set_index + pd.DataFrame.unstackConvenient and intuitive for some (myself included).  Cannot handle duplicate grouped keys.Similar to the groupby paradigm, we specify all columns that will eventually be either row or column levels and set those to be the index.  We then unstack the levels we want in the columns.  If either the remaining index levels or column levels are not unique, this method will fail.pd.DataFrame.pivotVery similar to set_index in that it shares the duplicate key limitation.  The API is very limited as well.  It only takes scalar values for index, columns, values.Similar to the pivot_table method in that we select rows, columns, and values on which to pivot.  However, we cannot aggregate and if either rows or columns are not unique, this method will fail.pd.crosstabThis a specialized version of pivot_table and in its purest form is the most intuitive way to perform several tasks.pd.factorize + np.bincountThis is a highly advanced technique that is very obscure but is very fast.  It cannot be used in all circumstances, but when it can be used and you are comfortable using it, you will reap the performance rewards.pd.get_dummies + pd.DataFrame.dotI use this for cleverly performing cross tabulation.See also:Reshaping and pivot tables — pandas User GuideQuestion 1Why do I get ValueError: Index contains duplicate entries, cannot reshapeThis occurs because pandas is attempting to reindex either a columns or index object with duplicate entries.  There are varying methods to use that can perform a pivot.  Some of them are not well suited to when there are duplicates of the keys on which it is being asked to pivot.  For example:  Consider pd.DataFrame.pivot.  I know there are duplicate entries that share the row and col values:df.duplicated(['row', 'col']).any()TrueSo when I pivot usingdf.pivot(index='row', columns='col', values='val0')I get the error mentioned above.  In fact, I get the same error when I try to perform the same task with:df.set_index(['row', 'col'])['val0'].unstack()ExamplesWhat I&#8217;m going to do for each subsequent question is to answer it using pd.DataFrame.pivot_table.  Then I&#8217;ll provide alternatives to perform the same task.Questions 2 and 3How do I pivot df such that the col values are columns, row values are the index, and mean of val0 are the values?pd.DataFrame.pivot_tabledf.pivot_table(    values='val0', index='row', columns='col',    aggfunc='mean')col   col0   col1   col2   col3  col4row                                  row0  0.77  0.605    NaN  0.860  0.65row2  0.13    NaN  0.395  0.500  0.25row3   NaN  0.310    NaN  0.545   NaNrow4   NaN  0.100  0.395  0.760  0.24aggfunc='mean' is the default and I didn&#8217;t have to set it.  I included it to be explicit.How do I make it so that missing values are 0?pd.DataFrame.pivot_tablefill_value is not set by default.  I tend to set it appropriately.  In this case I set it to 0.df.pivot_table(    values='val0', index='row', columns='col',    fill_value=0, aggfunc='mean')col   col0   col1   col2   col3  col4rowrow0  0.77  0.605  0.000  0.860  0.65row2  0.13  0.000  0.395  0.500  0.25row3  0.00  0.310  0.000  0.545  0.00row4  0.00  0.100  0.395  0.760  0.24pd.DataFrame.groupbydf.groupby(['row', 'col'])['val0'].mean().unstack(fill_value=0)pd.crosstabpd.crosstab(    index=df['row'], columns=df['col'],    values=df['val0'], aggfunc='mean').fillna(0)Question 4Can I get something other than mean, like maybe sum?pd.DataFrame.pivot_tabledf.pivot_table(    values='val0', index='row', columns='col',    fill_value=0, aggfunc='sum')col   col0  col1  col2  col3  col4rowrow0  0.77  1.21  0.00  0.86  0.65row2  0.13  0.00  0.79  0.50  0.50row3  0.00  0.31  0.00  1.09  0.00row4  0.00  0.10  0.79  1.52  0.24pd.DataFrame.groupbydf.groupby(['row', 'col'])['val0'].sum().unstack(fill_value=0)pd.crosstabpd.crosstab(    index=df['row'], columns=df['col'],    values=df['val0'], aggfunc='sum').fillna(0)Question 5Can I do more that one aggregation at a time?Notice that for pivot_table and crosstab I needed to pass list of callables.  On the other hand, groupby.agg is able to take strings for a limited number of special functions.  groupby.agg would also have taken the same callables we passed to the others, but it is often more efficient to leverage the string function names as there are efficiencies to be gained.pd.DataFrame.pivot_tabledf.pivot_table(    values='val0', index='row', columns='col',    fill_value=0, aggfunc=[np.size, np.mean])     size                      meancol  col0 col1 col2 col3 col4  col0   col1   col2   col3  col4rowrow0    1    2    0    1    1  0.77  0.605  0.000  0.860  0.65row2    1    0    2    1    2  0.13  0.000  0.395  0.500  0.25row3    0    1    0    2    0  0.00  0.310  0.000  0.545  0.00row4    0    1    2    2    1  0.00  0.100  0.395  0.760  0.24pd.DataFrame.groupbydf.groupby(['row', 'col'])['val0'].agg(['size', 'mean']).unstack(fill_value=0)pd.crosstabpd.crosstab(    index=df['row'], columns=df['col'],    values=df['val0'], aggfunc=[np.size, np.mean]).fillna(0, downcast='infer')Question 6Can I aggregate over multiple value columns?pd.DataFrame.pivot_table we pass values=['val0', 'val1'] but we could&#8217;ve left that off completelydf.pivot_table(    values=['val0', 'val1'], index='row', columns='col',    fill_value=0, aggfunc='mean')      val0                             val1col   col0   col1   col2   col3  col4  col0   col1  col2   col3  col4rowrow0  0.77  0.605  0.000  0.860  0.65  0.01  0.745  0.00  0.010  0.02row2  0.13  0.000  0.395  0.500  0.25  0.45  0.000  0.34  0.440  0.79row3  0.00  0.310  0.000  0.545  0.00  0.00  0.230  0.00  0.075  0.00row4  0.00  0.100  0.395  0.760  0.24  0.00  0.070  0.42  0.300  0.46pd.DataFrame.groupbydf.groupby(['row', 'col'])['val0', 'val1'].mean().unstack(fill_value=0)Question 7Can I subdivide by multiple columns?pd.DataFrame.pivot_tabledf.pivot_table(    values='val0', index='row', columns=['item', 'col'],    fill_value=0, aggfunc='mean')item item0             item1                         item2col   col2  col3  col4  col0  col1  col2  col3  col4  col0   col1  col3  col4rowrow0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.605  0.86  0.65row2  0.35  0.00  0.37  0.00  0.00  0.44  0.00  0.00  0.13  0.000  0.50  0.13row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.000  0.28  0.00row4  0.15  0.64  0.00  0.00  0.10  0.64  0.88  0.24  0.00  0.000  0.00  0.00pd.DataFrame.groupbydf.groupby(    ['row', 'item', 'col'])['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)Question 8Can I subdivide by multiple columns?pd.DataFrame.pivot_tabledf.pivot_table(    values='val0', index=['key', 'row'], columns=['item', 'col'],    fill_value=0, aggfunc='mean')item      item0             item1                         item2col        col2  col3  col4  col0  col1  col2  col3  col4  col0  col1  col3  col4key  rowkey0 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.86  0.00     row2  0.00  0.00  0.37  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.50  0.00     row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.00  0.00  0.00     row4  0.15  0.64  0.00  0.00  0.00  0.00  0.00  0.24  0.00  0.00  0.00  0.00key1 row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.81  0.00  0.65     row2  0.35  0.00  0.00  0.00  0.00  0.44  0.00  0.00  0.00  0.00  0.00  0.13     row3  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.28  0.00     row4  0.00  0.00  0.00  0.00  0.10  0.00  0.00  0.00  0.00  0.00  0.00  0.00key2 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.40  0.00  0.00     row2  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.13  0.00  0.00  0.00     row4  0.00  0.00  0.00  0.00  0.00  0.64  0.88  0.00  0.00  0.00  0.00  0.00pd.DataFrame.groupbydf.groupby(    ['key', 'row', 'item', 'col'])['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)pd.DataFrame.set_index because the set of keys are unique for both rows and columnsdf.set_index(    ['key', 'row', 'item', 'col'])['val0'].unstack(['item', 'col']).fillna(0).sort_index(1)Question 9Can I aggregate the frequency in which the column and rows occur together, aka &#8220;cross tabulation&#8221;?pd.DataFrame.pivot_tabledf.pivot_table(index='row', columns='col', fill_value=0, aggfunc='size')col   col0  col1  col2  col3  col4rowrow0     1     2     0     1     1row2     1     0     2     1     2row3     0     1     0     2     0row4     0     1     2     2     1pd.DataFrame.groupbydf.groupby(['row', 'col'])['val0'].size().unstack(fill_value=0)pd.crosstabpd.crosstab(df['row'], df['col'])pd.factorize + np.bincount# get integer factorization `i` and unique values `r`# for column `'row'`i, r = pd.factorize(df['row'].values)# get integer factorization `j` and unique values `c`# for column `'col'`j, c = pd.factorize(df['col'].values)# `n` will be the number of rows# `m` will be the number of columnsn, m = r.size, c.size# `i * m + j` is a clever way of counting the# factorization bins assuming a flat array of length# `n * m`.  Which is why we subsequently reshape as `(n, m)`b = np.bincount(i * m + j, minlength=n * m).reshape(n, m)# BTW, whenever I read this, I think 'Bean, Rice, and Cheese'pd.DataFrame(b, r, c)      col3  col2  col0  col1  col4row3     2     0     0     1     0row2     1     2     1     0     2row0     1     0     1     2     1row4     2     2     0     1     1pd.get_dummiespd.get_dummies(df['row']).T.dot(pd.get_dummies(df['col']))      col0  col1  col2  col3  col4row0     1     2     0     1     1row2     1     0     2     1     2row3     0     1     0     2     0row4     0     1     2     2     1Question 10How do I convert a DataFrame from long to wide by pivoting on ONLY twocolumns?DataFrame.pivotThe first step is to assign a number to each row &#8211; this number will be the row index of that value in the pivoted result. This is done using GroupBy.cumcount:df2.insert(0, 'count', df2.groupby('A').cumcount())df2   count  A   B0      0  a   01      1  a  112      2  a   23      3  a  114      0  b  105      1  b  106      2  b  147      0  c   7The second step is to use the newly created column as the index to call DataFrame.pivot.df2.pivot(*df2)# df2.pivot(index='count', columns='A', values='B')A         a     b    ccount0       0.0  10.0  7.01      11.0  10.0  NaN2       2.0  14.0  NaN3      11.0   NaN  NaNDataFrame.pivot_tableWhereas DataFrame.pivot only accepts columns, DataFrame.pivot_table also accepts arrays, so the GroupBy.cumcount can be passed directly as the index without creating an explicit column.df2.pivot_table(index=df2.groupby('A').cumcount(), columns='A', values='B')A         a     b    c0       0.0  10.0  7.01      11.0  10.0  NaN2       2.0  14.0  NaN3      11.0   NaN  NaNQuestion 11How do I flatten the multiple index to single index after pivotIf columns type object with string joindf.columns = df.columns.map('|'.join)else formatdf.columns = df.columns.map('{0[0]}|{0[1]}'.format)

Setup

Questions

Advertisement

Answer

Question 1

Examples

Questions 2 and 3

Question 4

Question 5

Question 6

Question 7

Question 8

Question 9

Question 10

Question 11