Skip to content
Advertisement

Filter non-duplicated records in Python-pandas, based on group-by column and row-level comparison

This is a complicated issue and I am not able to figure this out, and I really appreciate your help in this.

The below dataframe is generated from a pandas function DataFrame.duplicated(), based on ‘Loc'(groupby) and ‘Category’ repeated records are marked as True/False accordingly.

Number  Loc    Category        Date       IsDuplicate     
-----------------------------------------------------
1        A     jetski        01/01/2020       False
2        A     kayak         01/02/2020       False
3        A     jetski,kayak  01/04/2020       False
4        B     jetski        01/05/2020       False
5        B     jetski        01/07/2020       True
6        C     kayak         01/08/2020       False
7        C     kayak         01/09/2020       True
8        C     jetski        01/10/2020       False

My Expectation is to create another column based on ‘Loc'(groupby), ‘Category’ and ‘IsDuplicate’ to represent only those values that are actually repeated. Only False data should be marked as ‘Not Applicable’

Points:

  1. Groupby Location

  2. For any location:

    a. if ‘IsDuplicate’ == True then match ‘category’ col and return only matching rows as True/False

    b. if any other only False record found, return ‘Not Applicable’

  3. For any only False value in the location return ‘Not Applicable’

Expected Output:

Number  Loc    Category       Date       IsDuplicate     Only_Dupes   
---------------------------------------------------------------------
1        A     jetski       01/01/2020    False         Not Applicable       
2        A     kayak        01/02/2020    False         Not Applicable       
3        A     jetski,kayak 01/04/2020    False         Not Applicable
4        B     jetski       01/05/2020    False         False          
5        B     jetski       01/07/2020    True          True           
6        C     kayak        01/08/2020    False         False           
7        C     kayak        01/09/2020    True          True           
8        C     jetski       01/10/2020    False         Not Applicable     

Please let me know if any more clarification is required. And I thank you for all your assistance.

Advertisement

Answer

You can try creating 2 conditions 1 for checking duplicates and another for getting no of appearences of column Category grouped on Loc and Category, then using np.where assign the result of duplicated() where count is greater than 1 , else Not Applicable

c1 = df.duplicated(['Loc','Category'])
c2 = df.groupby(['Loc','Category'])['Category'].transform('count').gt(1)
df['Only_Dupes'] = np.where(c2,c1,'Not Applicable')

Or similar logic but chaining them in transform:

df['Only_Dupes'] = df.groupby(['Loc','Category'])['Category'].transform(lambda x: 
                          np.where(x.count()>1,x.duplicated(),'Not Applicable'))

print(df)

   Number Loc      Category  IsDuplicate      Only_Dupes
0       1   A        jetski        False  Not Applicable
1       2   A         kayak        False  Not Applicable
2       3   A  jetski,kayak        False  Not Applicable
3       4   B        jetski        False           False
4       5   B        jetski         True            True
5       6   C         kayak        False           False
6       7   C         kayak         True            True
7       8   C        jetski        False  Not Applicable
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement