Skip to content
Advertisement

Pandas: groupby followed by aggregate – unexpected behaviour when joining strings

Having a pandas data frame containing two columns of type str:

    group   sc  wc
0   1       A   word1
1   2       B   word2
2   2       C   word3
3   1       D   word4

which is created as follows:

df = pd.DataFrame({"group":[1,2,2,1],"sc":["A","B","C","D"],"wc":["word1", "word2", "word3","word4"]})

When grouping by group and joining the individual columns, I can use:

df.groupby("group",as_index=False).agg(lambda x: '|'.join(x))

    group   sc  wc
0   1   A|D word1|word4
1   2   B|C word2|word3

However, when specifying a single column (wc) to perform this operation on:

df.groupby("group",as_index=False)["wc"].agg(lambda x: '|'.join(x))

    group   wc
0   1   group|sc|wc
1   2   group|sc|wc

which appears to be a join on the column names. But why is it handled this way?

A proper implementation would make use of apply:

df.groupby("group",as_index=False)["wc"].apply(lambda x: '|'.join(x))

    group   wc
0   1   word1|word4
1   2   word2|word3

I stumbled upon this, as I wanted to avoid apply for larger dataframes to due performance issues (in my case it is 4-fold increase of speed using agg over apply).

What actually want to do is a join each value of sc and wc and then combine the groups into a single string like:

df["merged"] = df.sc + "|" + df.wc
df.groupby("group",as_index=False).agg('|'.join))

    group   sc  wc  merged
0   1   A|D word1|word4 A|word1|D|word4
1   2   B|C word2|word3 B|word2|C|word3

There is even more to it, once integers are used:

df = pd.DataFrame({"group":[1,2,2,1],"sc":[3,33,333,3333],"wc":[1,10,100,1000]})
df["merged"] = df.sc.astype(str) + "|" + df.wc.astype(str)
df.groupby(["group"],as_index=False).agg('|'.join)

    group   merged
0   1   3|1|3333|1000
1   2   33|10|333|100

This indicates that join is only run on the string columns.

The consecutive join and agg saves me a lot of computational time but does not feel right. Any insights are welcome!

Advertisement

Answer

I’m pretty sure this is a bug related to GroupBy.agg that manifests because of as_index=False – the entire subgroup DataFrame is passed to agg. Remove that and the output is as expected.

df.groupby("group")["wc"].agg('|'.join).reset_index()
 
   group           wc
0      1  word1|word4
1      2  word2|word3

When the subgroup is passed with its columns, calling str.join will join the column names, like so

'|'.join(df)
# 'group|sc|wc'  # this joins on the column names because iteration 
                 # on dataFrames devolves to iteration over headers

Do note that there is little difference between using agg and apply with a function that is non-cythonized (or at the very least, does not have fastpaths).

User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement