Having a pandas
data frame containing two columns of type str
:
group sc wc 0 1 A word1 1 2 B word2 2 2 C word3 3 1 D word4
which is created as follows:
df = pd.DataFrame({"group":[1,2,2,1],"sc":["A","B","C","D"],"wc":["word1", "word2", "word3","word4"]})
When grouping by group
and joining the individual columns, I can use:
df.groupby("group",as_index=False).agg(lambda x: '|'.join(x)) group sc wc 0 1 A|D word1|word4 1 2 B|C word2|word3
However, when specifying a single column (wc
) to perform this operation on:
df.groupby("group",as_index=False)["wc"].agg(lambda x: '|'.join(x)) group wc 0 1 group|sc|wc 1 2 group|sc|wc
which appears to be a join
on the column names. But why is it handled this way?
A proper implementation would make use of apply
:
df.groupby("group",as_index=False)["wc"].apply(lambda x: '|'.join(x)) group wc 0 1 word1|word4 1 2 word2|word3
I stumbled upon this, as I wanted to avoid apply for larger dataframes to due performance issues (in my case it is 4-fold increase of speed using agg
over apply
).
What actually want to do is a join
each value of sc
and wc
and then combine the groups into a single string like:
df["merged"] = df.sc + "|" + df.wc df.groupby("group",as_index=False).agg('|'.join)) group sc wc merged 0 1 A|D word1|word4 A|word1|D|word4 1 2 B|C word2|word3 B|word2|C|word3
There is even more to it, once integers are used:
df = pd.DataFrame({"group":[1,2,2,1],"sc":[3,33,333,3333],"wc":[1,10,100,1000]}) df["merged"] = df.sc.astype(str) + "|" + df.wc.astype(str) df.groupby(["group"],as_index=False).agg('|'.join) group merged 0 1 3|1|3333|1000 1 2 33|10|333|100
This indicates that join
is only run on the string columns.
The consecutive join
and agg
saves me a lot of computational time but does not feel right. Any insights are welcome!
Advertisement
Answer
I’m pretty sure this is a bug related to GroupBy.agg
that manifests because of as_index=False
– the entire subgroup DataFrame is passed to agg
. Remove that and the output is as expected.
df.groupby("group")["wc"].agg('|'.join).reset_index() group wc 0 1 word1|word4 1 2 word2|word3
When the subgroup is passed with its columns, calling str.join
will join the column names, like so
'|'.join(df) # 'group|sc|wc' # this joins on the column names because iteration # on dataFrames devolves to iteration over headers
Do note that there is little difference between using agg
and apply
with a function that is non-cythonized (or at the very least, does not have fastpaths).