Given the following dataframe,
is it possible to calculate the sum of col2 and the sum of col2 + col3,
in a single aggregating function?
import pandas as pd
df = pd.DataFrame({'col1': ['a', 'a', 'b', 'b'], 'col2': [1, 2, 3, 4], 'col3': [10, 20, 30, 40]})
| . | col1 | col2 | col3 |
|---|---|---|---|
| 0 | a | 1 | 10 |
| 1 | a | 2 | 20 |
| 2 | b | 3 | 30 |
| 3 | b | 4 | 40 |
In R’s dplyr I would do it with a single line of summarize,
and I was wondering what might be the equivalent in pandas:
df %>% group_by(col1) %>% summarize(col2_sum = sum(col2), col23_sum = sum(col2 + col3))
Desired result:
| . | col1 | col2_sum | col23_sum |
|---|---|---|---|
| 0 | a | 3 | 33 |
| 1 | b | 7 | 77 |
Advertisement
Answer
Let us try assign the new column first
out = df.assign(col23 = df.col2+df.col3).groupby('col1',as_index=False).sum()
Out[81]:
col1 col2 col3 col23 0 a 3 30 33 1 b 7 70 77
From my understanding the apply is more like the summarize in R
out = df.groupby('col1').
apply(lambda x : pd.Series({'col2_sum':x['col2'].sum(),
'col23_sum':(x['col2'] + x['col3']).sum()})).
reset_index()
Out[83]:
col1 col2_sum col23_sum
0 a 3 33
1 b 7 77