I have two dataframes, df1 and df2, which I would like to merge on the column ‘id’ where the ‘triggerdate’ from df1 falls between the ‘startdate’ and ‘enddate’ of df2, however, keep the rows where there’s no match.
df1:
id triggerdate a 09/01/2022 a 08/15/2022 b 06/25/2022 c 06/30/2022 c 07/01/2022
df2:
id startdate enddate value a 08/30/2022 09/03/2022 30 b 07/10/2022 07/15/2022 5 c 06/28/2022 07/05/2022 10
Expected Output:
id triggerdate startdate enddate value a 09/01/2022 08/30/2022 09/03/2022 30 a 08/15/2022 NaN NaN NaN b 06/25/2022 NaN NaN NaN c 06/30/2022 06/28/2022 07/05/2022 10 c 07/01/2022 06/28/2022 07/05/2022 10
The approach that I have taken so far is:
df_merged = df1.merge(df2, on = ['id'], how='outer') output = df_merged.loc[ df_merged['triggerdate'].between( df_merged['startdate'], df_merged['enddate'], inclusive='both')]
However, this approach does the following 1) Matches the ‘id’ values in df1 with df2 regardless of whether the condition is met and 2) then drops all the rows where the condition isn’t met.
Unfortunately, I had no luck finding the solution online.
What would be the recommended approach to get the expected output?
Thank you in advance for your help!
Advertisement
Answer
Try pd.merge_asof()
modify df2 to merge on date column
df2 = (pd.concat([df2, df2[['id','enddate']] .rename({'enddate':'startdate'},axis=1)]))
Then merge
(pd.merge_asof(df.reset_index().sort_values('triggerdate'), df2.sort_values('startdate'), left_on = 'triggerdate', right_on = 'startdate', by = 'id') .sort_values('index') .drop('index',axis=1))
Output:
id triggerdate startdate enddate value 4 a 2022-09-01 2022-08-30 2022-09-03 30.0 3 a 2022-08-15 NaT NaT NaN 0 b 2022-06-25 NaT NaT NaN 1 c 2022-06-30 2022-06-28 2022-07-05 10.0 2 c 2022-07-01 2022-06-28 2022-07-05 10.0