Python

I’m a newbie, so please excuse me if I use incorrect terms. I have a df with customer purchasing info and customers are identified by a unique user_id. Each item a user_id buys in each transaction creates a new row (if a customer buys 5 products in 1 transaction, 5 different rows are created with that products info).

I have created customer profiles based on 4 variables (income, age, dept id & parental status) using the loc function. It has worked, however, the outcome isn’t what I want. There are 106,143 customers in the df and 30,964,564 rows. The profiles I created (young parent, young single adult, higher earner, over 60, other [‘other’ to catch anything not assigned one of the other profiles]) are being assigned to each row, rather than to each user_id e.g. user_id 1 buys 5 items, 1 of which matches the conditions of ‘young parent’, the rest are assigned ‘other’. This is my code:

# create flag for 'High earner' (age: 40-59, income: 400000+, department_id: 1, 4, 7, 19, 16, parental_status: Parent)

final_df.loc[(final_df['age_range'].isin(['40-49', '50-59'])) & (final_df['income'] >= 400000) & (final_df['department_id'].isin([1, 4, 7, 19, 16])) & (final_df['parental_status'] == 'Parent'), 'customer_profile'] = 'Higher earner'

JavaScript
​x
 
# create flag for 'High earner' (age: 40-59, income: 400000+, department_id: 1, 4, 7, 19, 16, parental_status: Parent)
​
final_df.loc[(final_df['age_range'].isin(['40-49', '50-59'])) & (final_df['income'] >= 400000) & (final_df['department_id'].isin([1, 4, 7, 19, 16])) & (final_df['parental_status'] == 'Parent'), 'customer_profile'] = 'Higher earner'
​

# create flag for 'Young single adult' (age: <39, income: <=199999, department_id: 1, 4, 7, 19, parental_status: Non-parent)

final_df.loc[(final_df['age'] <= 39) & (final_df['income'] <= 199999) & (final_df['department_id'].isin([1, 4, 7, 19])) & (final_df['parental_status'] == 'Non-parent'), 'customer_profile'] = 'Young single adult'

JavaScript
 
# create flag for 'Young single adult' (age: <39, income: <=199999, department_id: 1, 4, 7, 19, parental_status: Non-parent)
​
final_df.loc[(final_df['age'] <= 39) & (final_df['income'] <= 199999) & (final_df['department_id'].isin([1, 4, 7, 19])) & (final_df['parental_status'] == 'Non-parent'), 'customer_profile'] = 'Young single adult'
​

# create flag for 'Young parent' (age: 20-39, income: <=199999, department_id: 4, 13, 16, 17, 18 parental_status: Parent)

final_df.loc[(final_df['age_range'].isin(['20-29', '30-39'])) & (final_df['income'] <= 199999) & (final_df['department_id'].isin([4, 13, 16, 17, 18])) & (final_df['parental_status'] == 'Parent'), 'customer_profile'] = 'Young parent'

JavaScript
 
# create flag for 'Young parent' (age: 20-39, income: <=199999, department_id: 4, 13, 16, 17, 18 parental_status: Parent)
​
final_df.loc[(final_df['age_range'].isin(['20-29', '30-39'])) & (final_df['income'] <= 199999) & (final_df['department_id'].isin([4, 13, 16, 17, 18])) & (final_df['parental_status'] == 'Parent'), 'customer_profile'] = 'Young parent'
​

# create flag for 'Over 60' (age: 60+, income: <=199999, department_id: 1, 4, 11, 12, 15, 20 parental_status: Parent)

final_df.loc[(final_df['age'] >= 60) & (final_df['income'] <= 199999) & (final_df['department_id'].isin([1, 4, 11, 12, 15, 20])) & (final_df['parental_status'] == 'Parent'), 'customer_profile'] = 'Over 60'

JavaScript
 
# create flag for 'Over 60' (age: 60+, income: <=199999, department_id: 1, 4, 11, 12, 15, 20 parental_status: Parent)
​
final_df.loc[(final_df['age'] >= 60) & (final_df['income'] <= 199999) & (final_df['department_id'].isin([1, 4, 11, 12, 15, 20])) & (final_df['parental_status'] == 'Parent'), 'customer_profile'] = 'Over 60'
​

# impute all NaN values as 'Other'

final_df['customer_profile'].fillna('Other', inplace=True)

JavaScript
 
# impute all NaN values as 'Other'
​
final_df['customer_profile'].fillna('Other', inplace=True)
​

This is the result:

    user_id customer_profile
0   1   Other
1   1   Other
2   1   Other
3   1   Other
4   1   Other
5   1   Other
6   1   Other
7   1   Other
8   1   Other
9   1   Other
10  1   Young parent
11  1   Other
12  1   Other
13  1   Other
14  1   Other
15  1   Other
16  1   Other
17  1   Other
18  1   Other
19  1   Other
20  1   Other
21  1   Other
22  1   Other
23  1   Young parent
24  1   Young parent

JavaScript
 
    user_id customer_profile
 1   Other
 1   Other
 1   Other
 1   Other
 1   Other
 1   Other
 1   Other
 1   Other
 1   Other
 1   Other
1   Young parent
1   Other
1   Other
1   Other
1   Other
1   Other
1   Other
1   Other
1   Other
1   Other
1   Other
1   Other
1   Other
1   Young parent
1   Young parent
​

What I actually want is, “if ‘young parent’ (or any profile) is assigned even once to a user_id, then all ‘other’ for that user_id must be changed to ‘young parent’ too” (a customer cannot have 2 profiles!). So, the above results should show ‘young parent’ in each row.

Is this possible? Am I using the wrong function? My knowledge is limited and any advice would be appreciated!

Answer

Mask the Other values in the customer_profile column, then group the column by user_id and transform with first to select the first non-nan value per user_id

m = df['customer_profile'].eq('Other')
df['customer_profile'] = df['customer_profile'].mask(m)
                            .groupby(df['user_id']).transform('first')

JavaScript
 
m = df['customer_profile'].eq('Other')
df['customer_profile'] = df['customer_profile'].mask(m)
                            .groupby(df['user_id']).transform('first')
​

To further simplify this you can skip the final step in your code where you are using fillna to fill the Other values because to use groupby we have to mask this values back to NaN. So fillna is a redundant step.

df['customer_profile'] = df.groupby('user_id')['customer_profile'].transform('first')

JavaScript
 
df['customer_profile'] = df.groupby('user_id')['customer_profile'].transform('first')
​

    user_id customer_profile
0         1     Young parent
1         1     Young parent
2         1     Young parent
3         1     Young parent
4         1     Young parent
5         1     Young parent
6         1     Young parent
7         1     Young parent
...
23        1     Young parent
24        1     Young parent

JavaScript
 
    user_id customer_profile
       1     Young parent
       1     Young parent
       1     Young parent
       1     Young parent
       1     Young parent
       1     Young parent
       1     Young parent
       1     Young parent
...
      1     Young parent
      1     Young parent
​

Newbie – customer profiling in Python (pandas) using loc()

Advertisement

Answer