Skip to content
Advertisement

How to merge two dataframes and eliminate dupes

I am trying to merge two dataframes together. One has 1.5M rows and one has 15M rows. I was expecting the merged dataframe to haev 15M rows, but it actually has 178M rows!! I think my merge is doing some kind of Cartesian product, and this isn not what I want.

This is what I tried, and got 178M rows.

df_merged = pd.merge(left=df_nat, right=df_stack, how='inner', left_on='eno', right_on='eno')

I tried the code below and got an out of memory error.

df_merged = pd.merge(df_nat, df_stack, how='inner', on='eno')

I’m guessing there are dupes in these dataframes, and that’s causing the final merge job to blow up. How can I do this so I have a final merged dataframe with 15M rows? Finally, the schemas are different, and only the ‘eno’ field is the same.

Thanks.

Advertisement

Answer

Try to remove the dups before merge both. It will greatly reduce memory usage:

df_1 = df_1.drop_duplicates(subset=['enodeb'], keep='last')
df_2 = df_2.drop_duplicates(subset=['enodeb'], keep='last')

If the datasets are too small to fit in the memory, maybe it is a good idea to use dask or vaex to do out-of-core processing.

User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement