How to merge two dataframes and eliminate dupes

Question

I am trying to merge two dataframes together. One has 1.5M rows and one has 15M rows. I was expecting the merged dataframe to haev 15M rows, but it actually has 178M rows!! I think my merge is doing some kind of Cartesian product, and this isn not what I want. This is what I tried, and got 178M rows.

Accepted Answer

Try to remove the dups before merge both. It will greatly reduce memory usage:df_1 = df_1.drop_duplicates(subset=['enodeb'], keep='last')df_2 = df_2.drop_duplicates(subset=['enodeb'], keep='last')If the datasets are too small to fit in the memory, maybe it is a good idea to use dask or vaex to do out-of-core processing.

Advertisement

Answer