Skip to content
Advertisement

Drop duplicated rows based on multiple columns if other column(s) is NaNs in Pandas

Given a test dataset as follows:

   id city district  quantity  price
0   1   bj       hd      12.0   23.0
1   2   bj       cy      23.0   45.0
2   3   bj       hd       NaN    NaN
3   4   bj       cy       NaN    NaN
4   5   sh       hp      12.0    NaN
5   6   sh       hp       NaN    NaN
6   7   sh       pd       NaN    NaN

I would like to drop duplicated rows based on city and district, then drop rows if its quantity is NaN, but if city and district are not duplicated, then even if quantity is NaN, it’s not necessary to drop rows.

Code based on link from here:

m1 = df['quantity'].notna()
m2 = ~df[['city', 'district']].duplicated()

df1 = df[m1 & m2]
print(df1)

Out:

   id city district  quantity  price
0   1   bj       hd      12.0   23.0
1   2   bj       cy      23.0   45.0
4   5   sh       hp      12.0    NaN

But I want to keep the last row since it’s not duplicated with any rows. How could I do that?

   id city district  quantity  price
0   1   bj       hd      12.0   23.0
1   2   bj       cy      23.0   45.0
2   5   sh       hp      12.0    NaN
3   7   sh       pd       NaN    NaN

Advertisement

Answer

According to your logic, it seems like you want to drop columns where:

  1. quantity is NaN => m1 = df['quantity'].isna() AND
  2. 'city', 'district' is duplicated => m2 = df[['city', 'district']].duplicated(keep=False)

And since you would like to take all columns except which meet the above conditions:

>>> m1 = df['quantity'].isna()
>>> m2 = df[['city', 'district']].duplicated(keep=False)
>>> df[~(m1 & m2)]

   id city district  quantity  price
0   1   bj       hd      12.0   23.0
1   2   bj       cy      23.0   45.0
4   5   sh       hp      12.0    NaN
6   7   sh       pd       NaN    NaN

And your original code would work with keep=False and | (or) operator.

>>> m1 = df['quantity'].notna()
>>> m2 = ~df[['city', 'district']].duplicated(keep=False)
>>> df[m1 | m2]
   id city district  quantity  price
0   1   bj       hd      12.0   23.0
1   2   bj       cy      23.0   45.0
4   5   sh       hp      12.0    NaN
6   7   sh       pd       NaN    NaN

EDIT

Based on your comments, if the df is:

>>> df
   id city district  quantity  price
0   1   bj       hd      12.0   23.0
1   2   bj       cy      23.0   45.0
2   3   bj       hd       NaN    NaN
3   4   bj       cy       NaN    NaN
4   8   sh       hp      14.0   15.0
5   8   sh       hp      14.0   16.0
6   7   sh       pd       NaN    NaN

# First drop duplicates with NaN items, with any of the above methods
>>> df[m1 | m2]
   id city district  quantity  price
0   1   bj       hd      12.0   23.0
1   2   bj       cy      23.0   45.0
4   8   sh       hp      14.0   15.0
5   8   sh       hp      14.0   16.0
6   7   sh       pd       NaN    NaN

# then drop duplicates with default condition:
>>> df[m1 | m2].drop_duplicates(['city', 'district'])
   id city district  quantity  price
0   1   bj       hd      12.0   23.0
1   2   bj       cy      23.0   45.0
4   8   sh       hp      14.0   15.0
6   7   sh       pd       NaN    NaN

You can change keep parameter of drop_duplicates to control default behavior, i.e. whether to keep the first duplicate or the last.

User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement