I have a problem regarding my pandas data frame which contains row level for users, such as which group they belong to, their country, type, and total number of impressions from that user. An example slice of my df:
┌──────────┬───────────┬─────────────┬──────────┬──────────────────┐ │ userID │ userGroup │ userCountry │ userType │ totalImpressions │ ├──────────┼───────────┼─────────────┼──────────┼──────────────────┤ │ Xn2r5u8x │ one │ USA │ premium │ 7454 │ │ fTjWnZr4 │ two │ USA │ basic │ 657 │ │ MbQeThWm │ two │ USA │ standard │ 578 │ │ Xp2s5v8y │ two │ JP │ core │ 34509 │ │ gUkXn2r5 │ three │ JP │ core │ 43 │ │ NcRfUjXn │ four │ UK │ premium │ 85656 │ │ WmZq4t7w │ four │ USA │ core │ 3456 │ │ eThVmYq3 │ four │ JP │ standard │ 7689 │ │ KbPeShVk │ four │ UK │ standard │ 92834 │ └──────────┴───────────┴─────────────┴──────────┴──────────────────┘
As you can see, the info is 1 row per user, where users in a group can belong to different country, type and different no of total impressions.
What I would like to do is roll this data up to the userGroup
level, getting rid of the userID
, keeping the userCountry
and userType
of the user with the highest number of totalImpressions
, and summing up the totalImpression
for all users in that group. This should result in a data frame like:
┌───────────┬──────────────┬───────────┬───────────────────────┐ │ userGroup │ groupCountry │ groupType │ groupTotalImpressions │ ├───────────┼──────────────┼───────────┼───────────────────────┤ │ one │ USA │ premium │ 7454 │ │ two │ JP │ core │ 35744 │ │ three │ JP │ core │ 43 │ │ four │ UK │ standard │ 189635 │ └───────────┴──────────────┴───────────┴───────────────────────┘
As you can see groupCountry
and groupType
are coming from the user within that group with the highest totalImpression
rather than the first row value in a group.
Is this something possible in pandas, I know I could aggregate using pd.groupby
, but from here I am not sure how to select the country/type of the the top user by totalImpressions. Any help would be greatly appreciated!
To generate my example data :
import pandas as pd lst = [['Xn2r5u8x', 'one', 'USA', 'premium', 7454], ['fTjWnZr4', 'two', 'USA', 'basic', 657], ['MbQeThWm', 'two', 'USA', 'standard', 578], ['Xp2s5v8y', 'two', 'JP', 'core', 34509], ['gUkXn2r5', 'three', 'JP', 'core', 43], ['NcRfUjXn', 'four', 'UK', 'premium', 85656], ['WmZq4t7w', 'four', 'USA', 'core', 3456], ['eThVmYq3', 'four', 'JP', 'standard', 7689], ['KbPeShVk', 'four', 'UK', 'standard', 92834]] df = pd.DataFrame(lst, columns =['userID', 'userGroup', 'userCountry', 'userType', 'totalImpressions'])```
Advertisement
Answer
If you sort ascending your dataframe by the totalImpressions
column, you just have to keep the last row for each group and sum impressions.
Use groupby.agg
:
out = df.sort_values('totalImpressions').groupby('userGroup', as_index=False) .agg({'userGroup': 'last', 'userCountry': 'last', 'userType': 'last', 'totalImpressions': 'sum'}) print(out) # Output userGroup userCountry userType totalImpressions 0 four UK standard 189635 1 one USA premium 7454 2 three JP core 43 3 two JP core 35744