Skip to content
Advertisement

Filter non-duplicated records in Python-pandas, based on group-by column and row-level comparison

This is a complicated issue and I am not able to figure this out, and I really appreciate your help in this.

The below dataframe is generated from a pandas function DataFrame.duplicated(), based on ‘Loc'(groupby) and ‘Category’ repeated records are marked as True/False accordingly.

JavaScript

My Expectation is to create another column based on ‘Loc'(groupby), ‘Category’ and ‘IsDuplicate’ to represent only those values that are actually repeated. Only False data should be marked as ‘Not Applicable’

Points:

  1. Groupby Location

  2. For any location:

    a. if ‘IsDuplicate’ == True then match ‘category’ col and return only matching rows as True/False

    b. if any other only False record found, return ‘Not Applicable’

  3. For any only False value in the location return ‘Not Applicable’

Expected Output:

JavaScript

Please let me know if any more clarification is required. And I thank you for all your assistance.

Advertisement

Answer

You can try creating 2 conditions 1 for checking duplicates and another for getting no of appearences of column Category grouped on Loc and Category, then using np.where assign the result of duplicated() where count is greater than 1 , else Not Applicable

JavaScript

Or similar logic but chaining them in transform:

JavaScript

JavaScript
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement