Skip to content
Advertisement

Group by the column in df python

I have a simple df. It has two columns. I want to groupby the values based on column a. Here is a simple example: Any input would be greatly appreciated!

import pandas as pd
import numpy as np
df = pd.DataFrame()
df['a'] = [1, 2, 3, 4, 1, 2]
df['b'] = [10, 20, 30,40, 50,60]

Desired output is:

 df = pd.DataFrame()
 df['a'] = [1, 2, 3, 4]
 df['b'] = [10, 20, 30,40 ]
 df['b1'] = [50, 60, np.nan, np.nan ]

df

Advertisement

Answer

Here’s a way to do what you want. First you want to group by column ‘a’. Normally groupby is used to calculate group aggregation functions:

df.groupby('a')['b'].mean()

but in this case we want to keep the values of b associated with each a. You can use

[(a,list(b)) for a,b in df.groupby('a')['b']]    

[(1, [10, 50]), (2, [20, 60]), (3, [30]), (4, [40])]

Conversion of this to a dataframe almost gets us there:

df2 = pd.DataFrame([(a,list(b)) for a,b in df.groupby('a')['b']],
                   columns=['a','temp'])

   a      temp
0  1  [10, 50]
1  2  [20, 60]
2  3      [30]
3  4      [40]

The column temp can be separated into different columns with to_list:

pd.DataFrame(df2['temp'].to_list())

    0     1
0  10  50.0
1  20  60.0
2  30   NaN
3  40   NaN

Rejoin the output dataframes:

df2.join(df3)

   a      temp   0     1
0  1  [10, 50]  10  50.0
1  2  [20, 60]  20  60.0
2  3      [30]  30   NaN
3  4      [40]  40   NaN

And clean up (remove temp column, rename columns and you probably want to do something about the integers cast to floats in the last column due to the NaNs)

I’m sure there’s a cleaner way to do this but hopefully this gets you started!

Advertisement