Filter non-duplicated records in Python-pandas, based on group-by column and row-level comparison

Question

This is a complicated issue and I am not able to figure this out, and I really appreciate your help in this. The below dataframe is generated from a pandas function DataFrame.duplicated(), based on &#8216;Loc'(groupby) and &#8216;Category&#8217; repeated records are marked as True/False accordingly. My Expect…

Accepted Answer

You can try creating 2 conditions 1 for checking duplicates and another for getting no of appearences of column Category grouped on Loc and Category, then using np.where assign the result of duplicated() where count is greater than 1 , else Not Applicablec1 = df.duplicated(['Loc','Category'])c2 = df.groupby(['Loc','Category'])['Category'].transform('count').gt(1)df['Only_Dupes'] = np.where(c2,c1,'Not Applicable')Or similar logic but chaining them in transform:df['Only_Dupes'] = df.groupby(['Loc','Category'])['Category'].transform(lambda x:                           np.where(x.count()>1,x.duplicated(),'Not Applicable'))print(df)   Number Loc      Category  IsDuplicate      Only_Dupes0       1   A        jetski        False  Not Applicable1       2   A         kayak        False  Not Applicable2       3   A  jetski,kayak        False  Not Applicable3       4   B        jetski        False           False4       5   B        jetski         True            True5       6   C         kayak        False           False6       7   C         kayak         True            True7       8   C        jetski        False  Not Applicable

Advertisement

Answer