I am trying to remove duplicates based on the column item_id
from a dataframe df
.
df
:
date code item_id 0 20210325 30893 001 002 003 003 1 20210325 10030 001 002 003 003
In this df
the item_id
is as follows:
These are all item_ids separated by one or more spaces. 0 -> "001 002 003 003" #here there is an extra space after 001, rest is same. 1 -> "001 002 003 003"
I am using the following function to remove the duplicates.
def create_data_file_removed_duplicate_item(packing_data): print('start removing duplicated item data') print('data count before removing duplication: ' + str(len(packing_data))) # check null packing_data = packing_data[~packing_data['item_id'].isnull()] # sorting item id packing_data['item_id_list'] = packing_data['item_id'].str.split(' ').apply(sorted) .apply(lambda item_id_list: ''.join([item_id.replace(' ', '') + ' ' for item_id in item_id_list])) # drop duplicate item_id packing_data.drop_duplicates(keep='last', inplace=True, subset=['item_id_list']) packing_data = packing_data.drop(columns=['item_id_list']) # create non duplicate item data file print('data count after removing duplication: ' + str(len(packing_data))) return packing_data
I am unable to remove the duplicates although there rows 0
and 1
have similar item_id
.
I have some other cases where this function removes duplicates where the item_id
is as follows:
0 -> "001 002 003 003". # there is no space after 001. These are all item_ids separated by one or more spaces. 1 -> "001 002 003 003"
Expected output
:
date code item_id 0 20210325 10030 001 002 003 003
Is there a way where I can remove the duplicates even if the item_id
is separated by multiple spaces?
Advertisement
Answer
You can apply a function to the column that will make the item_id “uniform”, then can drop_duplicates()
import pandas as pd df = pd.DataFrame({'date':['20210325','20210325'], 'code':['30893','10030'], 'item_id':['001 002 003 003','001 002 003 003']}) df['item_id'] = df['item_id'].apply(lambda x: ' '.join(sorted(x.split())).strip()) df = df.drop_duplicates(subset='item_id', keep="last")
Output:
print(df) date code item_id 1 20210325 10030 001 002 003 003