How to remove duplicates from a dataframe based on the column with string values

Question

I am trying to remove duplicates based on the column item_id from a dataframe df. df : In this df the item_id is as follows: I am using the following function to remove the duplicates. I am unable to remove the duplicates although there rows 0 and 1 have similar item_id. I have some other cases where this function removes

Accepted Answer

You can apply a function to the column that will make the item_id &#8220;uniform&#8221;, then can drop_duplicates()import pandas as pddf = pd.DataFrame({'date':['20210325','20210325'],                   'code':['30893','10030'],                   'item_id':['001 002 003 003','001    002 003 003']})df['item_id'] = df['item_id'].apply(lambda x: ' '.join(sorted(x.split())).strip())df = df.drop_duplicates(subset='item_id', keep="last")Output:print(df)       date   code          item_id1  20210325  10030  001 002 003 003

Advertisement

Answer