Skip to content
Advertisement

How can I write a Python function that loops through a list of dataframes and drops duplicate rows?

I have a “Main” dataframe which is being continuously built. Each month I merge new data into that one, however I want to remove rows that exist in both dataframes.

Example “Main”

Name Date Begin Time End Time
Bob 03/10/2022 11:04 14:10
Dirk 05/12/2022 13:15 16:56
Steve 01/13/2022 11:11 13:13

“Other”

Name Date Begin Time End Time
Rog 03/14/2022 11:44 14:30
Dirk 05/12/2022 13:15 16:56
Steve 01/13/2022 11:11 13:13

Expected Result

Name Date Begin Time End Time
Bob 03/10/2022 11:04 14:10
Rog 03/14/2022 11:44 14:30

So far I have been able to accomplish this in a messy way, but there has to be a neater way to do this…

def merger(month, month2):
   a = pd.merge(df1, month, how = 'left', indicator = True)
   a = a[a['_merge'] != 'both'].drop('_merge', axis = 1)
   b = pd.merge(a, month2, how = 'left', indicator = True)
   b = b[b['_merge'] != 'both'].drop('_merge', axis = 1)
   return b

It’s hideous and I have to keep adding to the function as each month arrives, but I am fairly new to Python and coding in general. Ideally I would like to write a function that loops through a list of dataframes as I can have up to 12 months at a time. It is important that it removes rows where the indicator = ‘both’ before merging the next dataframe.

Any help?

Advertisement

Answer

IIUC, this can be handled easily with df.drop_duplicates: Thanks to @Mustafa Aydın for reminding keep=False

new_df = pd.concat([main, other])
new_df = new_df.drop_duplicates(keep=False)

In my opinion, pd.concat suits better here as you can concat as many dataframes as you want:

larger_df = pd.concat(df_list)
User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement