I’m a newbie, so please excuse me if I use incorrect terms. I have a df with customer purchasing info and customers are identified by a unique user_id. Each item a user_id buys in each transaction creates a new row (if a customer buys 5 products in 1 transaction, 5 different rows are created with that products info).
I have created customer profiles based on 4 variables (income, age, dept id & parental status) using the loc function. It has worked, however, the outcome isn’t what I want. There are 106,143 customers in the df and 30,964,564 rows. The profiles I created (young parent, young single adult, higher earner, over 60, other [‘other’ to catch anything not assigned one of the other profiles]) are being assigned to each row, rather than to each user_id e.g. user_id 1 buys 5 items, 1 of which matches the conditions of ‘young parent’, the rest are assigned ‘other’. This is my code:
# create flag for 'High earner' (age: 40-59, income: 400000+, department_id: 1, 4, 7, 19, 16, parental_status: Parent) final_df.loc[(final_df['age_range'].isin(['40-49', '50-59'])) & (final_df['income'] >= 400000) & (final_df['department_id'].isin([1, 4, 7, 19, 16])) & (final_df['parental_status'] == 'Parent'), 'customer_profile'] = 'Higher earner'
# create flag for 'Young single adult' (age: <39, income: <=199999, department_id: 1, 4, 7, 19, parental_status: Non-parent) final_df.loc[(final_df['age'] <= 39) & (final_df['income'] <= 199999) & (final_df['department_id'].isin([1, 4, 7, 19])) & (final_df['parental_status'] == 'Non-parent'), 'customer_profile'] = 'Young single adult'
# create flag for 'Young parent' (age: 20-39, income: <=199999, department_id: 4, 13, 16, 17, 18 parental_status: Parent) final_df.loc[(final_df['age_range'].isin(['20-29', '30-39'])) & (final_df['income'] <= 199999) & (final_df['department_id'].isin([4, 13, 16, 17, 18])) & (final_df['parental_status'] == 'Parent'), 'customer_profile'] = 'Young parent'
# create flag for 'Over 60' (age: 60+, income: <=199999, department_id: 1, 4, 11, 12, 15, 20 parental_status: Parent) final_df.loc[(final_df['age'] >= 60) & (final_df['income'] <= 199999) & (final_df['department_id'].isin([1, 4, 11, 12, 15, 20])) & (final_df['parental_status'] == 'Parent'), 'customer_profile'] = 'Over 60'
# impute all NaN values as 'Other' final_df['customer_profile'].fillna('Other', inplace=True)
This is the result:
user_id customer_profile 0 1 Other 1 1 Other 2 1 Other 3 1 Other 4 1 Other 5 1 Other 6 1 Other 7 1 Other 8 1 Other 9 1 Other 10 1 Young parent 11 1 Other 12 1 Other 13 1 Other 14 1 Other 15 1 Other 16 1 Other 17 1 Other 18 1 Other 19 1 Other 20 1 Other 21 1 Other 22 1 Other 23 1 Young parent 24 1 Young parent
What I actually want is, “if ‘young parent’ (or any profile) is assigned even once to a user_id, then all ‘other’ for that user_id must be changed to ‘young parent’ too” (a customer cannot have 2 profiles!). So, the above results should show ‘young parent’ in each row.
Is this possible? Am I using the wrong function? My knowledge is limited and any advice would be appreciated!
Advertisement
Answer
Mask
the Other
values in the customer_profile
column, then group
the column by user_id
and transform
with first
to select the first non-nan value per user_id
m = df['customer_profile'].eq('Other') df['customer_profile'] = df['customer_profile'].mask(m) .groupby(df['user_id']).transform('first')
To further simplify this you can skip the final step in your code where you are using fillna
to fill the Other
values because to use groupby we have to mask this values back to NaN
. So fillna
is a redundant step.
df['customer_profile'] = df.groupby('user_id')['customer_profile'].transform('first')
user_id customer_profile 0 1 Young parent 1 1 Young parent 2 1 Young parent 3 1 Young parent 4 1 Young parent 5 1 Young parent 6 1 Young parent 7 1 Young parent ... 23 1 Young parent 24 1 Young parent