I have a "Main" dataframe which is being continuously built. Each month I merge new data into that one, however I want to remove rows that exist in both dataframes. Example "Main" Name Date Begin Time End Time Bob 03/10/2022 11:04 14:10 Dirk 05/12/2022 13:15 16:56 Steve 01/13/2022 11:11 13:13 "Other" Name Date Begin Time End Time Rog 03/14/2022 11:44

How can I write a Python function that loops through a list of dataframes and drops duplicate rows?

I have a “Main” dataframe which is being continuously built. Each month I merge new data into that one, however I want to remove rows that exist in both dataframes.

Example “Main”

Name	Date	Begin Time	End Time
Bob	03/10/2022	11:04	14:10
Dirk	05/12/2022	13:15	16:56
Steve	01/13/2022	11:11	13:13

“Other”

Name	Date	Begin Time	End Time
Rog	03/14/2022	11:44	14:30
Dirk	05/12/2022	13:15	16:56
Steve	01/13/2022	11:11	13:13

Expected Result

Name	Date	Begin Time	End Time
Bob	03/10/2022	11:04	14:10
Rog	03/14/2022	11:44	14:30

So far I have been able to accomplish this in a messy way, but there has to be a neater way to do this…

def merger(month, month2):
   a = pd.merge(df1, month, how = 'left', indicator = True)
   a = a[a['_merge'] != 'both'].drop('_merge', axis = 1)
   b = pd.merge(a, month2, how = 'left', indicator = True)
   b = b[b['_merge'] != 'both'].drop('_merge', axis = 1)
   return b

JavaScript
​x
 
def merger(month, month2):
   a = pd.merge(df1, month, how = 'left', indicator = True)
   a = a[a['_merge'] != 'both'].drop('_merge', axis = 1)
   b = pd.merge(a, month2, how = 'left', indicator = True)
   b = b[b['_merge'] != 'both'].drop('_merge', axis = 1)
   return b
​

It’s hideous and I have to keep adding to the function as each month arrives, but I am fairly new to Python and coding in general. Ideally I would like to write a function that loops through a list of dataframes as I can have up to 12 months at a time. It is important that it removes rows where the indicator = ‘both’ before merging the next dataframe.

Any help?

Answer

IIUC, this can be handled easily with df.drop_duplicates: Thanks to @Mustafa Aydın for reminding keep=False

new_df = pd.concat([main, other])
new_df = new_df.drop_duplicates(keep=False)

JavaScript
 
new_df = pd.concat([main, other])
new_df = new_df.drop_duplicates(keep=False)
​

In my opinion, pd.concat suits better here as you can concat as many dataframes as you want:

larger_df = pd.concat(df_list)

JavaScript
 
larger_df = pd.concat(df_list)
​

Advertisement

Answer