I’m obviously dealing with slightly more complex and realistic data, but to showcase my trouble, let’s assume we have these data:
import pandas as pd import numpy as np purchases_df = pd.DataFrame({"user_id": [100, 101, 100, 101, 200], "date": ['2022-01-01', '2022-01-01','2022-01-01','2022-01-01', '2022-01-01'], "purchase": ['cookies', 'jam', 'jam', 'jam', 'cashews']})
I want to find modal values of purchases by date:
agg_mode = purchases_df.groupby(['date', 'user_id'])['purchase'].agg(pd.Series.mode) agg_mode
agg_mode
will show that for user_id
100
we have two modal values: [cookies, jam]
. This is totally fine with me, when it comes to real data we’ve come up with a set of rules which mode to pick if there’s a tie. The problem is, to use that heuristic, I need to able to check if the returned set of multiple modal values contains certain values (let’s say, if cookies
and jam
are returned, we’d always stick to jam
only. I can’t find a simple way to process returned multimodal values:
agg_mode_df = purchases_df.groupby(['date', 'user_id'])['purchase'].agg(pd.Series.mode).to_frame() agg_mode_df.reset_index(inplace=True) agg_mode_df
agg_mode_df
is a DataFrame, and the purchase
column (which now holds the modal values) becomes of object dtype
with numpy ndarrays
in case of more than one mode
for the user_id
, and I couldn’t find a working way to convert the modal value(s) of every single user to a list
.
Am I overthinking this?
Advertisement
Answer
IIUC, try:
agg_mode = purchases_df.groupby(['date', 'user_id'])['purchase'].agg(lambda x: x.mode().tolist())