This is a complicated issue and I am not able to figure this out, and I really appreciate your help in this.
The below dataframe is generated from a pandas function DataFrame.duplicated()
, based on ‘Loc'(groupby) and ‘Category’ repeated records are marked as True/False accordingly.
Number Loc Category Date IsDuplicate ----------------------------------------------------- 1 A jetski 01/01/2020 False 2 A kayak 01/02/2020 False 3 A jetski,kayak 01/04/2020 False 4 B jetski 01/05/2020 False 5 B jetski 01/07/2020 True 6 C kayak 01/08/2020 False 7 C kayak 01/09/2020 True 8 C jetski 01/10/2020 False
My Expectation is to create another column based on ‘Loc'(groupby), ‘Category’ and ‘IsDuplicate’ to represent only those values that are actually repeated. Only False data should be marked as ‘Not Applicable’
Points:
Groupby Location
For any location:
a. if ‘IsDuplicate’ == True then match ‘category’ col and return only matching rows as True/False
b. if any other only False record found, return ‘Not Applicable’
For any only False value in the location return ‘Not Applicable’
Expected Output:
Number Loc Category Date IsDuplicate Only_Dupes --------------------------------------------------------------------- 1 A jetski 01/01/2020 False Not Applicable 2 A kayak 01/02/2020 False Not Applicable 3 A jetski,kayak 01/04/2020 False Not Applicable 4 B jetski 01/05/2020 False False 5 B jetski 01/07/2020 True True 6 C kayak 01/08/2020 False False 7 C kayak 01/09/2020 True True 8 C jetski 01/10/2020 False Not Applicable
Please let me know if any more clarification is required. And I thank you for all your assistance.
Advertisement
Answer
You can try creating 2 conditions 1 for checking duplicates and another for getting no of appearences of column Category grouped on Loc and Category, then using np.where
assign the result of duplicated()
where count is greater than 1 , else Not Applicable
c1 = df.duplicated(['Loc','Category']) c2 = df.groupby(['Loc','Category'])['Category'].transform('count').gt(1) df['Only_Dupes'] = np.where(c2,c1,'Not Applicable')
Or similar logic but chaining them in transform:
df['Only_Dupes'] = df.groupby(['Loc','Category'])['Category'].transform(lambda x: np.where(x.count()>1,x.duplicated(),'Not Applicable'))
print(df) Number Loc Category IsDuplicate Only_Dupes 0 1 A jetski False Not Applicable 1 2 A kayak False Not Applicable 2 3 A jetski,kayak False Not Applicable 3 4 B jetski False False 4 5 B jetski True True 5 6 C kayak False False 6 7 C kayak True True 7 8 C jetski False Not Applicable