Skip to content
Advertisement

Newbie – customer profiling in Python (pandas) using loc()

I’m a newbie, so please excuse me if I use incorrect terms. I have a df with customer purchasing info and customers are identified by a unique user_id. Each item a user_id buys in each transaction creates a new row (if a customer buys 5 products in 1 transaction, 5 different rows are created with that products info).

I have created customer profiles based on 4 variables (income, age, dept id & parental status) using the loc function. It has worked, however, the outcome isn’t what I want. There are 106,143 customers in the df and 30,964,564 rows. The profiles I created (young parent, young single adult, higher earner, over 60, other [‘other’ to catch anything not assigned one of the other profiles]) are being assigned to each row, rather than to each user_id e.g. user_id 1 buys 5 items, 1 of which matches the conditions of ‘young parent’, the rest are assigned ‘other’. This is my code:

JavaScript
JavaScript
JavaScript
JavaScript
JavaScript

This is the result:

JavaScript

What I actually want is, “if ‘young parent’ (or any profile) is assigned even once to a user_id, then all ‘other’ for that user_id must be changed to ‘young parent’ too” (a customer cannot have 2 profiles!). So, the above results should show ‘young parent’ in each row.

Is this possible? Am I using the wrong function? My knowledge is limited and any advice would be appreciated!

Advertisement

Answer

Mask the Other values in the customer_profile column, then group the column by user_id and transform with first to select the first non-nan value per user_id

JavaScript

To further simplify this you can skip the final step in your code where you are using fillna to fill the Other values because to use groupby we have to mask this values back to NaN. So fillna is a redundant step.

JavaScript

JavaScript
User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement