I am looking to join two data frames using the pd.merge_asof function. This function allows me to match data on a unique id and/or a nearest key. In this example, I am matching on the id as well as the nearest date that is less than or equal to the date in df1.
Is there a way to prevent the data from df2 being recycled when joining?
This is the code that I currently have that recycles the values in df2.
JavaScript
x
17
17
1
import pandas as pd
2
import datetime as dt
3
4
df1 = pd.DataFrame({'date': [dt.datetime(2020, 1, 2), dt.datetime(2020, 2, 2), dt.datetime(2020, 3, 2)],
5
'id': ['a', 'a', 'a']})
6
7
df2 = pd.DataFrame({'date': [dt.datetime(2020, 1, 1)],
8
'id': ['a'],
9
'value': ['1']})
10
11
pd.merge_asof(df1,
12
df2,
13
on='date',
14
by='id',
15
direction='backward',
16
allow_exact_matches=True)
17
This is the output that I would like to see instead where only the first match is successful
Advertisement
Answer
Given your merge direction is backward
, you can do a mask on duplicated id
and df2’s date
after merge_asof
:
JavaScript
1
13
13
1
out = pd.merge_asof(df1,
2
df2.rename(columns={'date':'date1'}), # rename df2's date
3
left_on='date',
4
right_on='date1', # so we can work on it later
5
by='id',
6
direction='backward',
7
allow_exact_matches=True)
8
9
# mask the value
10
out['value'] = out['value'].mask(out.duplicated(['id','date1']))
11
# equivalently
12
# out.loc[out.duplicated(['id', 'date1']), 'value'] = np.nan
13
Output:
JavaScript
1
5
1
date id date1 value
2
0 2020-01-02 a 2020-01-01 1
3
1 2020-02-02 a 2020-01-01 NaN
4
2 2020-03-02 a 2020-01-01 NaN
5