I have two dataframes, and both of them are indexed by datetime. for example, the dataframe 1 is something below:
|date | value | +----------------+--------+ |2021-11-11 09:00| 1 | |2021-11-11 10:00| 1 | |2021-11-12 11:00| 2 | |2021-11-14 09:00| 2 | |2021-11-15 09:30| 3 |
and the dataframe 2 looks like:
|date | value | +----------------+--------+ |2021-11-10 11:00| 2 | |2021-11-11 09:30| 3 | |2021-11-12 12:00| 4 | |2021-11-13 09:50| 2 | |2021-11-15 10:30| 3 |
For each entry in dataframe 1, I want to find the most recent one entry in dataframe 2, and create a new column in dataframe 1 to setup the relationship between the two dataframes.
To make it more clearly, the expected results should look like below.
|date | value | df2_index | +----------------+--------+----------------| |2021-11-11 09:00| 1 |2021-11-10 11:00| |2021-11-11 10:00| 1 |2021-11-11 09:30| |2021-11-12 11:00| 2 |2021-11-11 09:30| |2021-11-14 09:00| 2 |2021-11-13 09:50| |2021-11-15 09:30| 3 |2021-11-13 09:50|
For the first entry in dataframe 1, 2021-11-11 09:00‘s most recent one is 2021-11-10 11:00, and the third entry 2021-11-12 11:00‘s most recent one which means the largest timestamp that do not exceed 2021-11-12 11:00 in dataframe 2 is the 2021-11-11 09:30.
Is there any pandas method that could implement this function efficiently?
Great thanks.
Advertisement
Answer
pandas merge_asof should suffice :
pd.merge_asof(df1, df2.assign(df2_index = df2.date), on = 'date')
                 date  value_x  value_y           df2_index
0 2021-11-11 09:00:00        1        2 2021-11-10 11:00:00
1 2021-11-11 10:00:00        1        3 2021-11-11 09:30:00
2 2021-11-12 11:00:00        2        3 2021-11-11 09:30:00
3 2021-11-14 09:00:00        2        2 2021-11-13 09:50:00
4 2021-11-15 09:30:00        3        2 2021-11-13 09:50:00
