Skip to content
Advertisement

Pandas rolling up column values based upon max value in column when aggregating

I have a problem regarding my pandas data frame which contains row level for users, such as which group they belong to, their country, type, and total number of impressions from that user. An example slice of my df:

┌──────────┬───────────┬─────────────┬──────────┬──────────────────┐
│  userID  │ userGroup │ userCountry │ userType │ totalImpressions │
├──────────┼───────────┼─────────────┼──────────┼──────────────────┤
│ Xn2r5u8x │ one       │ USA         │ premium  │             7454 │
│ fTjWnZr4 │ two       │ USA         │ basic    │              657 │
│ MbQeThWm │ two       │ USA         │ standard │              578 │
│ Xp2s5v8y │ two       │ JP          │ core     │            34509 │
│ gUkXn2r5 │ three     │ JP          │ core     │               43 │
│ NcRfUjXn │ four      │ UK          │ premium  │            85656 │
│ WmZq4t7w │ four      │ USA         │ core     │             3456 │
│ eThVmYq3 │ four      │ JP          │ standard │             7689 │
│ KbPeShVk │ four      │ UK          │ standard │            92834 │
└──────────┴───────────┴─────────────┴──────────┴──────────────────┘

As you can see, the info is 1 row per user, where users in a group can belong to different country, type and different no of total impressions.

What I would like to do is roll this data up to the userGroup level, getting rid of the userID, keeping the userCountry and userType of the user with the highest number of totalImpressions , and summing up the totalImpression for all users in that group. This should result in a data frame like:

┌───────────┬──────────────┬───────────┬───────────────────────┐
│ userGroup │ groupCountry │ groupType │ groupTotalImpressions │
├───────────┼──────────────┼───────────┼───────────────────────┤
│ one       │ USA          │ premium   │                  7454 │
│ two       │ JP           │ core      │                 35744 │
│ three     │ JP           │ core      │                    43 │
│ four      │ UK           │ standard  │                189635 │
└───────────┴──────────────┴───────────┴───────────────────────┘

As you can see groupCountry and groupType are coming from the user within that group with the highest totalImpression rather than the first row value in a group.

Is this something possible in pandas, I know I could aggregate using pd.groupby, but from here I am not sure how to select the country/type of the the top user by totalImpressions. Any help would be greatly appreciated!

To generate my example data :

import pandas as pd 

lst = [['Xn2r5u8x', 'one', 'USA', 'premium', 7454],
['fTjWnZr4', 'two', 'USA', 'basic', 657],
['MbQeThWm', 'two', 'USA', 'standard', 578],
['Xp2s5v8y', 'two', 'JP', 'core', 34509],
['gUkXn2r5', 'three', 'JP', 'core', 43],
['NcRfUjXn', 'four', 'UK', 'premium', 85656],
['WmZq4t7w', 'four', 'USA', 'core', 3456],
['eThVmYq3', 'four', 'JP', 'standard', 7689],
['KbPeShVk', 'four', 'UK', 'standard', 92834]]

df = pd.DataFrame(lst, columns =['userID', 'userGroup', 'userCountry', 'userType', 'totalImpressions'])```

Advertisement

Answer

If you sort ascending your dataframe by the totalImpressions column, you just have to keep the last row for each group and sum impressions.

Use groupby.agg:

out = df.sort_values('totalImpressions').groupby('userGroup', as_index=False) 
        .agg({'userGroup': 'last', 'userCountry': 'last',
              'userType': 'last', 'totalImpressions': 'sum'})
print(out)

# Output
  userGroup userCountry  userType  totalImpressions
0      four          UK  standard            189635
1       one         USA   premium              7454
2     three          JP      core                43
3       two          JP      core             35744
User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement