Skip to content
Advertisement

Pandas: groupby followed by aggregate – unexpected behaviour when joining strings

Having a pandas data frame containing two columns of type str:

JavaScript

which is created as follows:

df = pd.DataFrame({"group":[1,2,2,1],"sc":["A","B","C","D"],"wc":["word1", "word2", "word3","word4"]})

When grouping by group and joining the individual columns, I can use:

JavaScript

However, when specifying a single column (wc) to perform this operation on:

JavaScript

which appears to be a join on the column names. But why is it handled this way?

A proper implementation would make use of apply:

JavaScript

I stumbled upon this, as I wanted to avoid apply for larger dataframes to due performance issues (in my case it is 4-fold increase of speed using agg over apply).

What actually want to do is a join each value of sc and wc and then combine the groups into a single string like:

JavaScript

There is even more to it, once integers are used:

JavaScript

This indicates that join is only run on the string columns.

The consecutive join and agg saves me a lot of computational time but does not feel right. Any insights are welcome!

Advertisement

Answer

I’m pretty sure this is a bug related to GroupBy.agg that manifests because of as_index=False – the entire subgroup DataFrame is passed to agg. Remove that and the output is as expected.

JavaScript

When the subgroup is passed with its columns, calling str.join will join the column names, like so

JavaScript

Do note that there is little difference between using agg and apply with a function that is non-cythonized (or at the very least, does not have fastpaths).

User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement