Skip to content
Advertisement

Can pandas perform an aggregating operation involving two columns?

Given the following dataframe,
is it possible to calculate the sum of col2 and the sum of col2 + col3,
in a single aggregating function?

import pandas as pd
df = pd.DataFrame({'col1': ['a', 'a', 'b', 'b'], 'col2': [1, 2, 3, 4], 'col3': [10, 20, 30, 40]})
. col1 col2 col3
0 a 1 10
1 a 2 20
2 b 3 30
3 b 4 40

In R’s dplyr I would do it with a single line of summarize,
and I was wondering what might be the equivalent in pandas:

df %>% group_by(col1) %>% summarize(col2_sum = sum(col2), col23_sum = sum(col2 + col3))

Desired result:

. col1 col2_sum col23_sum
0 a 3 33
1 b 7 77

Advertisement

Answer

Let us try assign the new column first

out = df.assign(col23 = df.col2+df.col3).groupby('col1',as_index=False).sum()

Out[81]:

  col1  col2  col3  col23
0    a     3    30     33
1    b     7    70     77

From my understanding the apply is more like the summarize in R

out = df.groupby('col1').
           apply(lambda x : pd.Series({'col2_sum':x['col2'].sum(),
                                       'col23_sum':(x['col2'] + x['col3']).sum()})).
               reset_index()
Out[83]: 
  col1  col2_sum  col23_sum
0    a         3         33
1    b         7         77
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement