I am trying to remove duplicates based on the column item_id from a dataframe df.
df :
date code item_id 0 20210325 30893 001 002 003 003 1 20210325 10030 001 002 003 003
In this df the item_id is as follows:
These are all item_ids separated by one or more spaces.
0 -> "001 002 003 003" #here there is an extra space after 001, rest is same.
1 -> "001 002 003 003"
I am using the following function to remove the duplicates.
def create_data_file_removed_duplicate_item(packing_data):
print('start removing duplicated item data')
print('data count before removing duplication: ' + str(len(packing_data)))
# check null
packing_data = packing_data[~packing_data['item_id'].isnull()]
# sorting item id
packing_data['item_id_list'] = packing_data['item_id'].str.split(' ').apply(sorted)
.apply(lambda item_id_list: ''.join([item_id.replace(' ', '') + ' ' for item_id in item_id_list]))
# drop duplicate item_id
packing_data.drop_duplicates(keep='last', inplace=True, subset=['item_id_list'])
packing_data = packing_data.drop(columns=['item_id_list'])
# create non duplicate item data file
print('data count after removing duplication: ' + str(len(packing_data)))
return packing_data
I am unable to remove the duplicates although there rows 0 and 1 have similar item_id.
I have some other cases where this function removes duplicates where the item_id is as follows:
0 -> "001 002 003 003". # there is no space after 001. These are all item_ids separated by one or more spaces. 1 -> "001 002 003 003"
Expected output:
date code item_id 0 20210325 10030 001 002 003 003
Is there a way where I can remove the duplicates even if the item_id is separated by multiple spaces?
Advertisement
Answer
You can apply a function to the column that will make the item_id “uniform”, then can drop_duplicates()
import pandas as pd
df = pd.DataFrame({'date':['20210325','20210325'],
'code':['30893','10030'],
'item_id':['001 002 003 003','001 002 003 003']})
df['item_id'] = df['item_id'].apply(lambda x: ' '.join(sorted(x.split())).strip())
df = df.drop_duplicates(subset='item_id', keep="last")
Output:
print(df)
date code item_id
1 20210325 10030 001 002 003 003