I have two dataframes, df1 and df2, which I would like to merge on the column ‘id’ where the ‘triggerdate’ from df1 falls between the ‘startdate’ and ‘enddate’ of df2, however, keep the rows where there’s no match.
df1:
JavaScript
x
7
1
id triggerdate
2
a 09/01/2022
3
a 08/15/2022
4
b 06/25/2022
5
c 06/30/2022
6
c 07/01/2022
7
df2:
JavaScript
1
5
1
id startdate enddate value
2
a 08/30/2022 09/03/2022 30
3
b 07/10/2022 07/15/2022 5
4
c 06/28/2022 07/05/2022 10
5
Expected Output:
JavaScript
1
7
1
id triggerdate startdate enddate value
2
a 09/01/2022 08/30/2022 09/03/2022 30
3
a 08/15/2022 NaN NaN NaN
4
b 06/25/2022 NaN NaN NaN
5
c 06/30/2022 06/28/2022 07/05/2022 10
6
c 07/01/2022 06/28/2022 07/05/2022 10
7
The approach that I have taken so far is:
JavaScript
1
7
1
df_merged = df1.merge(df2, on = ['id'], how='outer')
2
3
output = df_merged.loc[
4
df_merged['triggerdate'].between(
5
df_merged['startdate'],
6
df_merged['enddate'], inclusive='both')]
7
However, this approach does the following 1) Matches the ‘id’ values in df1 with df2 regardless of whether the condition is met and 2) then drops all the rows where the condition isn’t met.
Unfortunately, I had no luck finding the solution online.
What would be the recommended approach to get the expected output?
Thank you in advance for your help!
Advertisement
Answer
Try pd.merge_asof()
modify df2 to merge on date column
JavaScript
1
4
1
df2 = (pd.concat([df2,
2
df2[['id','enddate']]
3
.rename({'enddate':'startdate'},axis=1)]))
4
Then merge
JavaScript
1
8
1
(pd.merge_asof(df.reset_index().sort_values('triggerdate'),
2
df2.sort_values('startdate'),
3
left_on = 'triggerdate',
4
right_on = 'startdate',
5
by = 'id')
6
.sort_values('index')
7
.drop('index',axis=1))
8
Output:
JavaScript
1
7
1
id triggerdate startdate enddate value
2
4 a 2022-09-01 2022-08-30 2022-09-03 30.0
3
3 a 2022-08-15 NaT NaT NaN
4
0 b 2022-06-25 NaT NaT NaN
5
1 c 2022-06-30 2022-06-28 2022-07-05 10.0
6
2 c 2022-07-01 2022-06-28 2022-07-05 10.0
7