Given the following dataframe,
is it possible to calculate the sum of col2
and the sum of col2 + col3
,
in a single aggregating function?
import pandas as pd df = pd.DataFrame({'col1': ['a', 'a', 'b', 'b'], 'col2': [1, 2, 3, 4], 'col3': [10, 20, 30, 40]})
. | col1 | col2 | col3 |
---|---|---|---|
0 | a | 1 | 10 |
1 | a | 2 | 20 |
2 | b | 3 | 30 |
3 | b | 4 | 40 |
In R’s dplyr I would do it with a single line of summarize
,
and I was wondering what might be the equivalent in pandas:
df %>% group_by(col1) %>% summarize(col2_sum = sum(col2), col23_sum = sum(col2 + col3))
Desired result:
. | col1 | col2_sum | col23_sum |
---|---|---|---|
0 | a | 3 | 33 |
1 | b | 7 | 77 |
Advertisement
Answer
Let us try assign
the new column first
out = df.assign(col23 = df.col2+df.col3).groupby('col1',as_index=False).sum()
Out[81]:
col1 col2 col3 col23 0 a 3 30 33 1 b 7 70 77
From my understanding the apply
is more like the summarize
in R
out = df.groupby('col1'). apply(lambda x : pd.Series({'col2_sum':x['col2'].sum(), 'col23_sum':(x['col2'] + x['col3']).sum()})). reset_index() Out[83]: col1 col2_sum col23_sum 0 a 3 33 1 b 7 77