The question is not so clear I guess, so here is an example: given a dataframe:
company_name | company_size | company_acitivity |
---|---|---|
7 eleven | 5 | restaurant |
7 eleven | 5 | supermarket |
7 eleven | 10 | supermarket |
goldman sachs | 100 | bank |
goldman sachs | 200 | restaurant |
goldman sachs | 200 | bank |
I want to group the dataframe by company name and then replace the values in the organization_size and organization_acitivity columns with the values that have the highest occurrence for the respective company and column.
So in the end the dataframe should look like this:
company_name | company_size | company_acitivity |
---|---|---|
7 eleven | 5 | supermarket |
goldman sachs | 200 | bank |
I tried this:
df.groupby("organization_name",group_keys=True)["organization_activity"].apply(lambda x: x.mode())
But it only gives me
“AttributeError: ‘SeriesGroupBy’ object has no attribute ‘mode'”.
Does someone have an idea for an easier way to do this?
Advertisement
Answer
You don’t want to select a column after the groupby, since you want to apply that to all the available columns.
Try this:
df.groupby('company_name').apply(lambda x: x.mode()).reset_index(drop=True)
Output:
company_name company_size company_acitivity 0 7 eleven 5 supermarket 1 goldman sachs 200 bank