I know there are many questions like this one but I can’t seem to find the relevant answer. Let’s say I have 2 data frames as follow:
df1 = pd.DataFrame( { "end": [ "2019-08-31", "2019-08-28", "2019-09-09", "2019-09-08", "2019-09-14", "2019-09-14", ], "start": [ "2019-08-27", "2019-08-22", "2019-08-04", "2019-09-02", "2019-09-06", "2019-09-10", ], "id": [1234, 8679, 8679, 1234, 1234, 8679], } ) df2 = pd.DataFrame( { "timestamp": [ "2019-08-30 10:00", "2019-08-28 10:00", "2019-08-27 10:30", "2019-08-07 12:00", "2019-09-12 10:00", "2019-09-11 14:00", "2019-08-29 18:00", ], "id": [1234, 1234, 8679, 1234, 8679, 8679, 1234], "val": ["AAAB", "ABBA", "CXXC", "BBAA", "XCXC", "CCXX", "BAAB"], } ) df1["end"] = pd.to_datetime(df1["end"]) df1["start"] = pd.to_datetime(df1["start"]) df2["timestamp"] = pd.to_datetime(df2["timestamp"]) df1.sort_values(by=["end"], inplace=True) df2.sort_values(by="timestamp", inplace=True)
Resulted as:
end start id 0 2019-08-31 2019-08-27 1234 1 2019-08-28 2019-08-22 8679 2 2019-09-09 2019-08-04 8679 3 2019-09-08 2019-09-02 1234 4 2019-09-14 2019-09-06 1234 5 2019-09-14 2019-09-10 8679 timestamp id val 0 2019-08-30 10:00 1234 AAAB 1 2019-08-28 10:00 1234 ABBA 2 2019-08-27 10:30 8679 CXXC 3 2019-08-07 12:00 1234 BBAA 4 2019-09-12 10:00 8679 XCXC 5 2019-09-11 14:00 8679 CCXX 6 2019-08-29 18:00 1234 BAAB
The classic way to merge by ID so timestamp will be between start and end in df1 is by merge on id or dummy variable and filter:
merged_df = pd.merge(df1, df2, how="left", on="id") merged_df = merged_df.loc[ (merged_df["timestamp"] >= merged_df["start"]) & (merged_df["timestamp"] <= merged_df["end"]) ]
In which I get the output I wish to have:
end start id timestamp val 0 2019-08-31 2019-08-27 1234 2019-08-30 10:00 AAAB 1 2019-08-31 2019-08-27 1234 2019-08-28 10:00 ABBA 3 2019-08-31 2019-08-27 1234 2019-08-29 18:00 BAAB 4 2019-08-28 2019-08-22 8679 2019-08-27 10:30 CXXC 7 2019-09-09 2019-08-04 8679 2019-08-27 10:30 CXXC 19 2019-09-14 2019-09-10 8679 2019-09-12 10:00 XCXC 20 2019-09-14 2019-09-10 8679 2019-09-11 14:00 CCXX
My question: I need to do the same merge and get the same results but df1 is 200K rows and df2 is 600K.
What I have tried so far:
The classic way of merge and filter, as above, will fail because the initial merge will create a huge data frame that will overload the memory.
I also tried the pandasql approach which ended with my 16GB RAM PC
getting stuck.I tried the merge_asof in 3 steps of left join, right join and outer join as
explained here but I run some tests and it seems to always
return up to 2 records from df2 to a single line in df1.
Any good advice will be appreciated!
Advertisement
Answer
I’ve been working with niv-dudovitch and david-arenburg on this one, and here are our findings which I hope will be helpful to some of you out there… The core idea was to prevent growing objects in memory by creating a list of dataframes based on subsets of the data.
First version without multi-processing.
import pandas as pd unk = df1.id.unique() j = [None] * len(unk) k = 0 df1.set_index('id', inplace = True) df2.set_index('id', inplace = True) for i in unk: tmp = df1.loc[df1.index.isin([i])].join(df2.loc[df2.index.isin([i])], how='left') j[k] = tmp.loc[tmp['timestamp'].between(tmp['start'], tmp['end'])] k += 1 res = pd.concat(j) res
Using Multi-Process
In our real case, we have 2 large data frame df2 is about 3 million rows and df1 is slightly above 110K. The output is about 20M rows.
import multiprocessing as mp import itertools import concurrent from concurrent.futures import ProcessPoolExecutor import time import pandas as pd from itertools import repeat def get_val_between(ids, df1, df2): """ Locate all values between 2 dates by id Args: - ids (list): list of ids Returns: - concat list of dataframes """ j = [None] * len(ids) k = 0 for i in ids: tmp = df1.loc[df1.index.isin([i])].join( df2.loc[df2.index.isin([i])], how="left" ) tmp = tmp.loc[tmp["timestamp"].between(tmp["start"], tmp["end"])] # add to list in location k j[k] = tmp k += 1 # keep only not None dfs in j j = [i for i in j if i is not None] if len(j) > 0: return pd.concat(j) else: return None def grouper(n, iterable, fillvalue=None): """grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx""" args = [iter(iterable)] * n return itertools.zip_longest(fillvalue=fillvalue, *args) def main(): df1.reset_index(inplace=True, drop=True) df2.reset_index(inplace=True, drop=True) id_lst = df1.id.unique() iter_ids = grouper(10, list(id_lst)) df1.set_index("id", inplace=True) df2.set_index("id", inplace=True) # set multi-processes executor = concurrent.futures.ProcessPoolExecutor(20) result_futures = executor.map(get_val_between, iter_ids, repeat(df1), repeat(df2)) concurrent.futures.as_completed(result_futures) result_concat = pd.concat(result_futures) print(result_concat) if __name__ == "__main__": main()
results as expected:
end start timestamp val id 8679 2019-08-28 2019-08-22 2019-08-27 10:30:00 CXXC 8679 2019-09-09 2019-08-04 2019-08-27 10:30:00 CXXC 8679 2019-09-14 2019-09-10 2019-09-11 14:00:00 CCXX 8679 2019-09-14 2019-09-10 2019-09-12 10:00:00 XCXC 1234 2019-08-31 2019-08-27 2019-08-28 10:00:00 ABBA 1234 2019-08-31 2019-08-27 2019-08-29 18:00:00 BAAB 1234 2019-08-31 2019-08-27 2019-08-30 10:00:00 AAAB
As a benchmark with an output of 20 million rows, the Multi-Process approach is x10 times faster.