Given a test dataset as follows:
id city district quantity price 0 1 bj hd 12.0 23.0 1 2 bj cy 23.0 45.0 2 3 bj hd NaN NaN 3 4 bj cy NaN NaN 4 5 sh hp 12.0 NaN 5 6 sh hp NaN NaN 6 7 sh pd NaN NaN
I would like to drop duplicated rows based on city and district, then drop rows if its quantity is NaN, but if city and district are not duplicated, then even if quantity is NaN, it’s not necessary to drop rows.
Code based on link from here:
m1 = df['quantity'].notna() m2 = ~df[['city', 'district']].duplicated() df1 = df[m1 & m2] print(df1)
Out:
id city district quantity price 0 1 bj hd 12.0 23.0 1 2 bj cy 23.0 45.0 4 5 sh hp 12.0 NaN
But I want to keep the last row since it’s not duplicated with any rows. How could I do that?
id city district quantity price 0 1 bj hd 12.0 23.0 1 2 bj cy 23.0 45.0 2 5 sh hp 12.0 NaN 3 7 sh pd NaN NaN
Advertisement
Answer
According to your logic, it seems like you want to drop columns where:
quantityisNaN=>m1 = df['quantity'].isna()AND'city', 'district'isduplicated=>m2 = df[['city', 'district']].duplicated(keep=False)
And since you would like to take all columns except which meet the above conditions:
>>> m1 = df['quantity'].isna() >>> m2 = df[['city', 'district']].duplicated(keep=False) >>> df[~(m1 & m2)] id city district quantity price 0 1 bj hd 12.0 23.0 1 2 bj cy 23.0 45.0 4 5 sh hp 12.0 NaN 6 7 sh pd NaN NaN
And your original code would work with keep=False and | (or) operator.
>>> m1 = df['quantity'].notna() >>> m2 = ~df[['city', 'district']].duplicated(keep=False) >>> df[m1 | m2] id city district quantity price 0 1 bj hd 12.0 23.0 1 2 bj cy 23.0 45.0 4 5 sh hp 12.0 NaN 6 7 sh pd NaN NaN
EDIT
Based on your comments, if the df is:
>>> df id city district quantity price 0 1 bj hd 12.0 23.0 1 2 bj cy 23.0 45.0 2 3 bj hd NaN NaN 3 4 bj cy NaN NaN 4 8 sh hp 14.0 15.0 5 8 sh hp 14.0 16.0 6 7 sh pd NaN NaN # First drop duplicates with NaN items, with any of the above methods >>> df[m1 | m2] id city district quantity price 0 1 bj hd 12.0 23.0 1 2 bj cy 23.0 45.0 4 8 sh hp 14.0 15.0 5 8 sh hp 14.0 16.0 6 7 sh pd NaN NaN # then drop duplicates with default condition: >>> df[m1 | m2].drop_duplicates(['city', 'district']) id city district quantity price 0 1 bj hd 12.0 23.0 1 2 bj cy 23.0 45.0 4 8 sh hp 14.0 15.0 6 7 sh pd NaN NaN
You can change keep parameter of drop_duplicates to control default behavior, i.e. whether to keep the first duplicate or the last.