Skip to content
Advertisement

Efficient way to merge large Pandas dataframes between two dates

I know there are many questions like this one but I can’t seem to find the relevant answer. Let’s say I have 2 data frames as follow:

JavaScript

Resulted as:

JavaScript

The classic way to merge by ID so timestamp will be between start and end in df1 is by merge on id or dummy variable and filter:

JavaScript

In which I get the output I wish to have:

JavaScript

My question: I need to do the same merge and get the same results but df1 is 200K rows and df2 is 600K.

What I have tried so far:

  • The classic way of merge and filter, as above, will fail because the initial merge will create a huge data frame that will overload the memory.

  • I also tried the pandasql approach which ended with my 16GB RAM PC
    getting stuck.

  • I tried the merge_asof in 3 steps of left join, right join and outer join as
    explained here but I run some tests and it seems to always
    return up to 2 records from df2 to a single line in df1.

Any good advice will be appreciated!

Advertisement

Answer

I’ve been working with niv-dudovitch and david-arenburg on this one, and here are our findings which I hope will be helpful to some of you out there… The core idea was to prevent growing objects in memory by creating a list of dataframes based on subsets of the data.

First version without multi-processing.

JavaScript

Using Multi-Process

In our real case, we have 2 large data frame df2 is about 3 million rows and df1 is slightly above 110K. The output is about 20M rows.

JavaScript

results as expected:

JavaScript

As a benchmark with an output of 20 million rows, the Multi-Process approach is x10 times faster.

User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement