Drop duplicated rows based on multiple columns if other column(s) is NaNs in Pandas

Question

Given a test dataset as follows: I would like to drop duplicated rows based on city and district, then drop rows if its quantity is NaN, but if city and district are not duplicated, then even if quantity is NaN, it&#8217;s not necessary to drop rows. Code based on link from here: Out: But I want to keep the l…

Accepted Answer

According to your logic, it seems like you want to drop columns where:quantity is NaN => m1 = df['quantity'].isna() AND'city', 'district' is duplicated => m2 = df[['city', 'district']].duplicated(keep=False)And since you would like to take all columns except which meet the above conditions:>>> m1 = df['quantity'].isna()>>> m2 = df[['city', 'district']].duplicated(keep=False)>>> df[~(m1 & m2)]   id city district  quantity  price0   1   bj       hd      12.0   23.01   2   bj       cy      23.0   45.04   5   sh       hp      12.0    NaN6   7   sh       pd       NaN    NaNAnd your original code would work with keep=False and | (or) operator.>>> m1 = df['quantity'].notna()>>> m2 = ~df[['city', 'district']].duplicated(keep=False)>>> df[m1 | m2]   id city district  quantity  price0   1   bj       hd      12.0   23.01   2   bj       cy      23.0   45.04   5   sh       hp      12.0    NaN6   7   sh       pd       NaN    NaNEDITBased on your comments, if the df is:>>> df   id city district  quantity  price0   1   bj       hd      12.0   23.01   2   bj       cy      23.0   45.02   3   bj       hd       NaN    NaN3   4   bj       cy       NaN    NaN4   8   sh       hp      14.0   15.05   8   sh       hp      14.0   16.06   7   sh       pd       NaN    NaN# First drop duplicates with NaN items, with any of the above methods>>> df[m1 | m2]   id city district  quantity  price0   1   bj       hd      12.0   23.01   2   bj       cy      23.0   45.04   8   sh       hp      14.0   15.05   8   sh       hp      14.0   16.06   7   sh       pd       NaN    NaN# then drop duplicates with default condition:>>> df[m1 | m2].drop_duplicates(['city', 'district'])   id city district  quantity  price0   1   bj       hd      12.0   23.01   2   bj       cy      23.0   45.04   8   sh       hp      14.0   15.06   7   sh       pd       NaN    NaNYou can change keep parameter of drop_duplicates to control default behavior, i.e. whether to keep the first duplicate or the last.

Advertisement

Answer