I know there are many questions like this one but I can't seem to find the relevant answer. Let's say I have 2 data frames as follow: Resulted as: The classic way to merge by ID so timestamp will be between start and end in df1 is by merge on id or dummy variable and filter: In which I get

Efficient way to merge large Pandas dataframes between two dates

I know there are many questions like this one but I can’t seem to find the relevant answer. Let’s say I have 2 data frames as follow:

df1 = pd.DataFrame(
    {
        "end": [
            "2019-08-31",
            "2019-08-28",
            "2019-09-09",
            "2019-09-08",
            "2019-09-14",
            "2019-09-14",
        ],
        "start": [
            "2019-08-27",
            "2019-08-22",
            "2019-08-04",
            "2019-09-02",
            "2019-09-06",
            "2019-09-10",
        ],
        "id": [1234, 8679, 8679, 1234, 1234, 8679],
    }
)

df2 = pd.DataFrame(
    {
        "timestamp": [
            "2019-08-30 10:00",
            "2019-08-28 10:00",
            "2019-08-27 10:30",
            "2019-08-07 12:00",
            "2019-09-12 10:00",
            "2019-09-11 14:00",
            "2019-08-29 18:00",
        ],
        "id": [1234, 1234, 8679, 1234, 8679, 8679, 1234],
        "val": ["AAAB", "ABBA", "CXXC", "BBAA", "XCXC", "CCXX", "BAAB"],
    }
)

df1["end"] = pd.to_datetime(df1["end"])
df1["start"] = pd.to_datetime(df1["start"])

df2["timestamp"] = pd.to_datetime(df2["timestamp"])

df1.sort_values(by=["end"], inplace=True)
df2.sort_values(by="timestamp", inplace=True)

JavaScript
​x
 
df1 = pd.DataFrame(
    {
        "end": [
            "2019-08-31",
            "2019-08-28",
            "2019-09-09",
            "2019-09-08",
            "2019-09-14",
            "2019-09-14",
        ],
        "start": [
            "2019-08-27",
            "2019-08-22",
            "2019-08-04",
            "2019-09-02",
            "2019-09-06",
            "2019-09-10",
        ],
        "id": [1234, 8679, 8679, 1234, 1234, 8679],
    }
)
​
df2 = pd.DataFrame(
    {
        "timestamp": [
            "2019-08-30 10:00",
            "2019-08-28 10:00",
            "2019-08-27 10:30",
            "2019-08-07 12:00",
            "2019-09-12 10:00",
            "2019-09-11 14:00",
            "2019-08-29 18:00",
        ],
        "id": [1234, 1234, 8679, 1234, 8679, 8679, 1234],
        "val": ["AAAB", "ABBA", "CXXC", "BBAA", "XCXC", "CCXX", "BAAB"],
    }
)
​
df1["end"] = pd.to_datetime(df1["end"])
df1["start"] = pd.to_datetime(df1["start"])
​
df2["timestamp"] = pd.to_datetime(df2["timestamp"])
​
df1.sort_values(by=["end"], inplace=True)
df2.sort_values(by="timestamp", inplace=True)
​

Resulted as:

 end       start    id
0  2019-08-31  2019-08-27  1234
1  2019-08-28  2019-08-22  8679
2  2019-09-09  2019-08-04  8679
3  2019-09-08  2019-09-02  1234
4  2019-09-14  2019-09-06  1234
5  2019-09-14  2019-09-10  8679

 timestamp    id   val
0  2019-08-30 10:00  1234  AAAB
1  2019-08-28 10:00  1234  ABBA
2  2019-08-27 10:30  8679  CXXC
3  2019-08-07 12:00  1234  BBAA
4  2019-09-12 10:00  8679  XCXC
5  2019-09-11 14:00  8679  CCXX
6  2019-08-29 18:00  1234  BAAB

JavaScript
 
 end       start    id
2019-08-31  2019-08-27  1234
2019-08-28  2019-08-22  8679
2019-09-09  2019-08-04  8679
2019-09-08  2019-09-02  1234
2019-09-14  2019-09-06  1234
2019-09-14  2019-09-10  8679
​
 timestamp    id   val
2019-08-30 10:00  1234  AAAB
2019-08-28 10:00  1234  ABBA
2019-08-27 10:30  8679  CXXC
2019-08-07 12:00  1234  BBAA
2019-09-12 10:00  8679  XCXC
2019-09-11 14:00  8679  CCXX
2019-08-29 18:00  1234  BAAB
​

The classic way to merge by ID so timestamp will be between start and end in df1 is by merge on id or dummy variable and filter:

merged_df = pd.merge(df1, df2, how="left", on="id")
merged_df = merged_df.loc[
    (merged_df["timestamp"] >= merged_df["start"])
    & (merged_df["timestamp"] <= merged_df["end"])
]

JavaScript
 
merged_df = pd.merge(df1, df2, how="left", on="id")
merged_df = merged_df.loc[
    (merged_df["timestamp"] >= merged_df["start"])
    & (merged_df["timestamp"] <= merged_df["end"])
]
​

In which I get the output I wish to have:

           end       start    id         timestamp   val
0   2019-08-31  2019-08-27  1234  2019-08-30 10:00  AAAB
1   2019-08-31  2019-08-27  1234  2019-08-28 10:00  ABBA
3   2019-08-31  2019-08-27  1234  2019-08-29 18:00  BAAB
4   2019-08-28  2019-08-22  8679  2019-08-27 10:30  CXXC
7   2019-09-09  2019-08-04  8679  2019-08-27 10:30  CXXC
19  2019-09-14  2019-09-10  8679  2019-09-12 10:00  XCXC
20  2019-09-14  2019-09-10  8679  2019-09-11 14:00  CCXX

JavaScript
 
           end       start    id         timestamp   val
 2019-08-31  2019-08-27  1234  2019-08-30 10:00  AAAB
 2019-08-31  2019-08-27  1234  2019-08-28 10:00  ABBA
 2019-08-31  2019-08-27  1234  2019-08-29 18:00  BAAB
 2019-08-28  2019-08-22  8679  2019-08-27 10:30  CXXC
 2019-09-09  2019-08-04  8679  2019-08-27 10:30  CXXC
2019-09-14  2019-09-10  8679  2019-09-12 10:00  XCXC
2019-09-14  2019-09-10  8679  2019-09-11 14:00  CCXX
​

My question: I need to do the same merge and get the same results but df1 is 200K rows and df2 is 600K.

What I have tried so far:

The classic way of merge and filter, as above, will fail because the initial merge will create a huge data frame that will overload the memory.
I also tried the pandasql approach which ended with my 16GB RAM PC
getting stuck.
I tried the merge_asof in 3 steps of left join, right join and outer join as
explained here but I run some tests and it seems to always
return up to 2 records from df2 to a single line in df1.

Any good advice will be appreciated!

Answer

I’ve been working with niv-dudovitch and david-arenburg on this one, and here are our findings which I hope will be helpful to some of you out there… The core idea was to prevent growing objects in memory by creating a list of dataframes based on subsets of the data.

First version without multi-processing.

import pandas as pd

unk = df1.id.unique()
j = [None] * len(unk)
k = 0

df1.set_index('id', inplace = True)
df2.set_index('id', inplace = True)

for i in unk:
    tmp = df1.loc[df1.index.isin([i])].join(df2.loc[df2.index.isin([i])], how='left')  
    j[k] = tmp.loc[tmp['timestamp'].between(tmp['start'], tmp['end'])]
    k += 1
    
res = pd.concat(j)
res

JavaScript
 
import pandas as pd
​
unk = df1.id.unique()
j = [None] * len(unk)
k = 0
​
df1.set_index('id', inplace = True)
df2.set_index('id', inplace = True)
​
for i in unk:
    tmp = df1.loc[df1.index.isin([i])].join(df2.loc[df2.index.isin([i])], how='left')  
    j[k] = tmp.loc[tmp['timestamp'].between(tmp['start'], tmp['end'])]
    k += 1
    
res = pd.concat(j)
res
​

Using Multi-Process

In our real case, we have 2 large data frame df2 is about 3 million rows and df1 is slightly above 110K. The output is about 20M rows.

import multiprocessing as mp
import itertools
import concurrent
from concurrent.futures import ProcessPoolExecutor
import time
import pandas as pd
from itertools import repeat


def get_val_between(ids, df1, df2):
    """
    Locate all values between 2 dates by id
    Args:
        - ids (list): list of ids

    Returns:
        - concat list of dataframes
    """

    j = [None] * len(ids)
    k = 0

    for i in ids:
        tmp = df1.loc[df1.index.isin([i])].join(
            df2.loc[df2.index.isin([i])], how="left"
        )
        tmp = tmp.loc[tmp["timestamp"].between(tmp["start"], tmp["end"])]

        # add to list in location k

        j[k] = tmp
        k += 1
    # keep only not None dfs in j

    j = [i for i in j if i is not None]
    if len(j) > 0:
        return pd.concat(j)
    else:
        return None


def grouper(n, iterable, fillvalue=None):
    """grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"""

    args = [iter(iterable)] * n
    return itertools.zip_longest(fillvalue=fillvalue, *args)


def main():

    df1.reset_index(inplace=True, drop=True)
    df2.reset_index(inplace=True, drop=True)
    id_lst = df1.id.unique()
    iter_ids = grouper(10, list(id_lst))

    df1.set_index("id", inplace=True)
    df2.set_index("id", inplace=True)

    # set multi-processes

    executor = concurrent.futures.ProcessPoolExecutor(20)
    result_futures = executor.map(get_val_between, iter_ids, repeat(df1), repeat(df2))
    concurrent.futures.as_completed(result_futures)
    result_concat = pd.concat(result_futures)

    print(result_concat)


if __name__ == "__main__":
    main()

JavaScript
 
import multiprocessing as mp
import itertools
import concurrent
from concurrent.futures import ProcessPoolExecutor
import time
import pandas as pd
from itertools import repeat
​
​
def get_val_between(ids, df1, df2):
    """
    Locate all values between 2 dates by id
    Args:
        - ids (list): list of ids
​
    Returns:
        - concat list of dataframes
    """
​
    j = [None] * len(ids)
    k = 0
​
    for i in ids:
        tmp = df1.loc[df1.index.isin([i])].join(
            df2.loc[df2.index.isin([i])], how="left"
        )
        tmp = tmp.loc[tmp["timestamp"].between(tmp["start"], tmp["end"])]
​
        # add to list in location k
​
        j[k] = tmp
        k += 1
    # keep only not None dfs in j
​
    j = [i for i in j if i is not None]
    if len(j) > 0:
        return pd.concat(j)
    else:
        return None
​
​
def grouper(n, iterable, fillvalue=None):
    """grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"""
​
    args = [iter(iterable)] * n
    return itertools.zip_longest(fillvalue=fillvalue, *args)
​
​
def main():
​
    df1.reset_index(inplace=True, drop=True)
    df2.reset_index(inplace=True, drop=True)
    id_lst = df1.id.unique()
    iter_ids = grouper(10, list(id_lst))
​
    df1.set_index("id", inplace=True)
    df2.set_index("id", inplace=True)
​
    # set multi-processes
​
    executor = concurrent.futures.ProcessPoolExecutor(20)
    result_futures = executor.map(get_val_between, iter_ids, repeat(df1), repeat(df2))
    concurrent.futures.as_completed(result_futures)
    result_concat = pd.concat(result_futures)
​
    print(result_concat)
​
​
if __name__ == "__main__":
    main()
​
​

results as expected:

           end      start           timestamp   val
id                                                  
8679 2019-08-28 2019-08-22 2019-08-27 10:30:00  CXXC
8679 2019-09-09 2019-08-04 2019-08-27 10:30:00  CXXC
8679 2019-09-14 2019-09-10 2019-09-11 14:00:00  CCXX
8679 2019-09-14 2019-09-10 2019-09-12 10:00:00  XCXC
1234 2019-08-31 2019-08-27 2019-08-28 10:00:00  ABBA
1234 2019-08-31 2019-08-27 2019-08-29 18:00:00  BAAB
1234 2019-08-31 2019-08-27 2019-08-30 10:00:00  AAAB

JavaScript
 
           end      start           timestamp   val
id                                                  
8679 2019-08-28 2019-08-22 2019-08-27 10:30:00  CXXC
8679 2019-09-09 2019-08-04 2019-08-27 10:30:00  CXXC
8679 2019-09-14 2019-09-10 2019-09-11 14:00:00  CCXX
8679 2019-09-14 2019-09-10 2019-09-12 10:00:00  XCXC
1234 2019-08-31 2019-08-27 2019-08-28 10:00:00  ABBA
1234 2019-08-31 2019-08-27 2019-08-29 18:00:00  BAAB
1234 2019-08-31 2019-08-27 2019-08-30 10:00:00  AAAB
​
​

As a benchmark with an output of 20 million rows, the Multi-Process approach is x10 times faster.

Advertisement

Answer

First version without multi-processing.

Using Multi-Process