I'm currently doing the following operations based on a dataframe (A) made of two columns with multiple thousands of unique values each. The operations performed on this dataframe are: The output is a table (B) with unique values of col1 in rows and unique values of col2 in columns, and each cell is the count of rows, from the original

Faster alternative to groupby, unstack then fillna

I’m currently doing the following operations based on a dataframe (A) made of two columns with multiple thousands of unique values each.

>>> pd.DataFrame({
    'col1': ['foo', 'bar', 'bar', 'foo', 'baz', 'bar', 'baz'],
    'col2': ['abc', 'def', 'abc', 'abc', 'def', 'abc', 'ghi']
})

  col1 col2
0  foo  abc
1  bar  def
2  bar  abc
3  foo  abc
4  baz  def
5  bar  abc
6  baz  ghi

JavaScript
​x
 
>>> pd.DataFrame({
    'col1': ['foo', 'bar', 'bar', 'foo', 'baz', 'bar', 'baz'],
    'col2': ['abc', 'def', 'abc', 'abc', 'def', 'abc', 'ghi']
})
​
  col1 col2
0  foo  abc
1  bar  def
2  bar  abc
3  foo  abc
4  baz  def
5  bar  abc
6  baz  ghi
​

The operations performed on this dataframe are:

res = df.groupby(['col1', 'col2']).size().unstack().fillna(0)

JavaScript
 
res = df.groupby(['col1', 'col2']).size().unstack().fillna(0)
​

The output is a table (B) with unique values of col1 in rows and unique values of col2 in columns, and each cell is the count of rows, from the original dataframe, matching this combination of unique values.

>>> res
col2  abc  def  ghi
col1               
bar   2.0  1.0  0.0
baz   0.0  1.0  1.0
foo   2.0  0.0  0.0

JavaScript
 
>>> res
col2  abc  def  ghi
col1               
bar   2.0  1.0  0.0
baz   0.0  1.0  1.0
foo   2.0  0.0  0.0
​

The amount of time spent in each operation is approximately the following:

groupby().size() -> 5%
unstack() -> 15%
fillna(0) -> 80%

The whole sequence can take about 30 minutes on a real dataset (similar structure as above, just more rows and more unique values).

Is there a better/faster alternative to get from (A) the original dataframe to (B) the end-result table? The most costly operation is by far the final fillna(0) so I’d be interested in an alternative for this bit in particular, but an entirely different approach would be great as well.

Note: converting the strings to integer in the original df speeds up the groupby().size() operation by about 5x, however it doesn’t really affect the following operations

Answer

Take advantage of filling the NAs in the same step as you unstack by setting a fill_value:

 >>> df.groupby(['col1', 'col2']).size().unstack(fill_value=0)

JavaScript
 
 >>> df.groupby(['col1', 'col2']).size().unstack(fill_value=0)
​

Advertisement

Answer