Given a test dataset as follows:
id city district quantity price 0 1 bj hd 12.0 23.0 1 2 bj cy 23.0 45.0 2 3 bj hd NaN NaN 3 4 bj cy NaN NaN 4 5 sh hp 12.0 NaN 5 6 sh hp NaN NaN 6 7 sh pd NaN NaN
I would like to drop duplicated rows based on city
and district
, then drop rows if its quantity
is NaN
, but if city
and district
are not duplicated, then even if quantity
is NaN
, it’s not necessary to drop rows.
Code based on link from here:
m1 = df['quantity'].notna() m2 = ~df[['city', 'district']].duplicated() df1 = df[m1 & m2] print(df1)
Out:
id city district quantity price 0 1 bj hd 12.0 23.0 1 2 bj cy 23.0 45.0 4 5 sh hp 12.0 NaN
But I want to keep the last row since it’s not duplicated with any rows. How could I do that?
id city district quantity price 0 1 bj hd 12.0 23.0 1 2 bj cy 23.0 45.0 2 5 sh hp 12.0 NaN 3 7 sh pd NaN NaN
Advertisement
Answer
According to your logic, it seems like you want to drop columns where:
quantity
isNaN
=>m1 = df['quantity'].isna()
AND'city', 'district'
isduplicated
=>m2 = df[['city', 'district']].duplicated(keep=False)
And since you would like to take all columns except which meet the above conditions:
>>> m1 = df['quantity'].isna() >>> m2 = df[['city', 'district']].duplicated(keep=False) >>> df[~(m1 & m2)] id city district quantity price 0 1 bj hd 12.0 23.0 1 2 bj cy 23.0 45.0 4 5 sh hp 12.0 NaN 6 7 sh pd NaN NaN
And your original code would work with keep=False
and |
(or) operator.
>>> m1 = df['quantity'].notna() >>> m2 = ~df[['city', 'district']].duplicated(keep=False) >>> df[m1 | m2] id city district quantity price 0 1 bj hd 12.0 23.0 1 2 bj cy 23.0 45.0 4 5 sh hp 12.0 NaN 6 7 sh pd NaN NaN
EDIT
Based on your comments, if the df is:
>>> df id city district quantity price 0 1 bj hd 12.0 23.0 1 2 bj cy 23.0 45.0 2 3 bj hd NaN NaN 3 4 bj cy NaN NaN 4 8 sh hp 14.0 15.0 5 8 sh hp 14.0 16.0 6 7 sh pd NaN NaN # First drop duplicates with NaN items, with any of the above methods >>> df[m1 | m2] id city district quantity price 0 1 bj hd 12.0 23.0 1 2 bj cy 23.0 45.0 4 8 sh hp 14.0 15.0 5 8 sh hp 14.0 16.0 6 7 sh pd NaN NaN # then drop duplicates with default condition: >>> df[m1 | m2].drop_duplicates(['city', 'district']) id city district quantity price 0 1 bj hd 12.0 23.0 1 2 bj cy 23.0 45.0 4 8 sh hp 14.0 15.0 6 7 sh pd NaN NaN
You can change keep
parameter of drop_duplicates
to control default behavior, i.e. whether to keep the first
duplicate or the last
.