I know there are many questions like this one but I can’t seem to find the relevant answer. Let’s say I have 2 data frames as follow:
df1 = pd.DataFrame(
{
"end": [
"2019-08-31",
"2019-08-28",
"2019-09-09",
"2019-09-08",
"2019-09-14",
"2019-09-14",
],
"start": [
"2019-08-27",
"2019-08-22",
"2019-08-04",
"2019-09-02",
"2019-09-06",
"2019-09-10",
],
"id": [1234, 8679, 8679, 1234, 1234, 8679],
}
)
df2 = pd.DataFrame(
{
"timestamp": [
"2019-08-30 10:00",
"2019-08-28 10:00",
"2019-08-27 10:30",
"2019-08-07 12:00",
"2019-09-12 10:00",
"2019-09-11 14:00",
"2019-08-29 18:00",
],
"id": [1234, 1234, 8679, 1234, 8679, 8679, 1234],
"val": ["AAAB", "ABBA", "CXXC", "BBAA", "XCXC", "CCXX", "BAAB"],
}
)
df1["end"] = pd.to_datetime(df1["end"])
df1["start"] = pd.to_datetime(df1["start"])
df2["timestamp"] = pd.to_datetime(df2["timestamp"])
df1.sort_values(by=["end"], inplace=True)
df2.sort_values(by="timestamp", inplace=True)
Resulted as:
end start id
0 2019-08-31 2019-08-27 1234
1 2019-08-28 2019-08-22 8679
2 2019-09-09 2019-08-04 8679
3 2019-09-08 2019-09-02 1234
4 2019-09-14 2019-09-06 1234
5 2019-09-14 2019-09-10 8679
timestamp id val
0 2019-08-30 10:00 1234 AAAB
1 2019-08-28 10:00 1234 ABBA
2 2019-08-27 10:30 8679 CXXC
3 2019-08-07 12:00 1234 BBAA
4 2019-09-12 10:00 8679 XCXC
5 2019-09-11 14:00 8679 CCXX
6 2019-08-29 18:00 1234 BAAB
The classic way to merge by ID so timestamp will be between start and end in df1 is by merge on id or dummy variable and filter:
merged_df = pd.merge(df1, df2, how="left", on="id")
merged_df = merged_df.loc[
(merged_df["timestamp"] >= merged_df["start"])
& (merged_df["timestamp"] <= merged_df["end"])
]
In which I get the output I wish to have:
end start id timestamp val
0 2019-08-31 2019-08-27 1234 2019-08-30 10:00 AAAB
1 2019-08-31 2019-08-27 1234 2019-08-28 10:00 ABBA
3 2019-08-31 2019-08-27 1234 2019-08-29 18:00 BAAB
4 2019-08-28 2019-08-22 8679 2019-08-27 10:30 CXXC
7 2019-09-09 2019-08-04 8679 2019-08-27 10:30 CXXC
19 2019-09-14 2019-09-10 8679 2019-09-12 10:00 XCXC
20 2019-09-14 2019-09-10 8679 2019-09-11 14:00 CCXX
My question: I need to do the same merge and get the same results but df1 is 200K rows and df2 is 600K.
What I have tried so far:
The classic way of merge and filter, as above, will fail because the initial merge will create a huge data frame that will overload the memory.
I also tried the pandasql approach which ended with my 16GB RAM PC
getting stuck.I tried the merge_asof in 3 steps of left join, right join and outer join as
explained here but I run some tests and it seems to always
return up to 2 records from df2 to a single line in df1.
Any good advice will be appreciated!
Advertisement
Answer
I’ve been working with niv-dudovitch and david-arenburg on this one, and here are our findings which I hope will be helpful to some of you out there… The core idea was to prevent growing objects in memory by creating a list of dataframes based on subsets of the data.
First version without multi-processing.
import pandas as pd
unk = df1.id.unique()
j = [None] * len(unk)
k = 0
df1.set_index('id', inplace = True)
df2.set_index('id', inplace = True)
for i in unk:
tmp = df1.loc[df1.index.isin([i])].join(df2.loc[df2.index.isin([i])], how='left')
j[k] = tmp.loc[tmp['timestamp'].between(tmp['start'], tmp['end'])]
k += 1
res = pd.concat(j)
res
Using Multi-Process
In our real case, we have 2 large data frame df2 is about 3 million rows and df1 is slightly above 110K. The output is about 20M rows.
import multiprocessing as mp
import itertools
import concurrent
from concurrent.futures import ProcessPoolExecutor
import time
import pandas as pd
from itertools import repeat
def get_val_between(ids, df1, df2):
"""
Locate all values between 2 dates by id
Args:
- ids (list): list of ids
Returns:
- concat list of dataframes
"""
j = [None] * len(ids)
k = 0
for i in ids:
tmp = df1.loc[df1.index.isin([i])].join(
df2.loc[df2.index.isin([i])], how="left"
)
tmp = tmp.loc[tmp["timestamp"].between(tmp["start"], tmp["end"])]
# add to list in location k
j[k] = tmp
k += 1
# keep only not None dfs in j
j = [i for i in j if i is not None]
if len(j) > 0:
return pd.concat(j)
else:
return None
def grouper(n, iterable, fillvalue=None):
"""grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"""
args = [iter(iterable)] * n
return itertools.zip_longest(fillvalue=fillvalue, *args)
def main():
df1.reset_index(inplace=True, drop=True)
df2.reset_index(inplace=True, drop=True)
id_lst = df1.id.unique()
iter_ids = grouper(10, list(id_lst))
df1.set_index("id", inplace=True)
df2.set_index("id", inplace=True)
# set multi-processes
executor = concurrent.futures.ProcessPoolExecutor(20)
result_futures = executor.map(get_val_between, iter_ids, repeat(df1), repeat(df2))
concurrent.futures.as_completed(result_futures)
result_concat = pd.concat(result_futures)
print(result_concat)
if __name__ == "__main__":
main()
results as expected:
end start timestamp val
id
8679 2019-08-28 2019-08-22 2019-08-27 10:30:00 CXXC
8679 2019-09-09 2019-08-04 2019-08-27 10:30:00 CXXC
8679 2019-09-14 2019-09-10 2019-09-11 14:00:00 CCXX
8679 2019-09-14 2019-09-10 2019-09-12 10:00:00 XCXC
1234 2019-08-31 2019-08-27 2019-08-28 10:00:00 ABBA
1234 2019-08-31 2019-08-27 2019-08-29 18:00:00 BAAB
1234 2019-08-31 2019-08-27 2019-08-30 10:00:00 AAAB
As a benchmark with an output of 20 million rows, the Multi-Process approach is x10 times faster.