plyr or dplyr in Python

This is more of a conceptual question, I do not have a specific problem.

I am learning python for data analysis, but I am very familiar with R – one of the great things about R is plyr (and of course ggplot2) and even better dplyr. Pandas of course has split-apply as well however in R I can do things like (in dplyr, a bit different in plyr, and I can see now how dplyr mimics the . notation from object programming)

   data %.% group_by(c(.....)) %.% summarise(new1 = ...., new2 = ...., ..... newn=....)

in which I create multiple summary calculations at the same time

How do I do that in python, because

df[...].groupby(.....).sum() only sums columns,

while on R I can have one mean, one sum, one special function, etc. on one call

I realize I can do all my operations separately and merge them, and that is fine if I am using python, but when it comes down to choosing a tool, any line of code you do not have to type and check and validate adds up in time

in addition, in dplyr you can also add mutate statements as well, so it seems to me it is way more powerful – so what am I missing about pandas or python –

My goal is to learn, I have spent a lot of effort to learn python and it is a worthy investment, but still the question remains

Answer

I think you’re looking for the agg function, which is applied to groupby objects.

From the docs:

In [48]: grouped = df.groupby('A')

In [49]: grouped['C'].agg([np.sum, np.mean, np.std])
Out[49]: 
          sum      mean       std
A                                
bar  0.443469  0.147823  0.301765
foo  2.529056  0.505811  0.96

Advertisement

Answer