So I currently have 2 dataframes. These have different columns and what I have been trying to figure out is how to merge on an interval index as well as a unique ID value. Below are 2 different examples of the dataframes I have:
UniqueID,Start_Date,End_Date ID1,01-01-2020,01-08-2020 ID2,01-02-2020,01-04-2020 ID3,01-03-2020,01-05-2020 ID4,01-04-2020,01-09-2020 ID5,01-05-2020,01-10-2020 ID6,01-06-2020,01-11-2020
Creating the dataframe:
pd.DataFrame({ 'UniqueId': ['ID1','ID2','ID3','ID4','ID5','ID6'], 'Start_Date': ['01-01-2020','01-02-2020','01-03-2020','01-04-2020','01-05-2020','01-06-2020'], 'End_Date': ['01-08-2020','01-04-2020','01-05-2020','01-09-2020','01-10-2020','01-11-2020'] }) UniqueID,Trip_Date,Value ID1,10-02-2020,1 ID1,15-02-2020,207 ID2,06-03-2020,10 ID3,29-01-2022,15 ID9,15-02-2020,207 ID12,19-06-2021,189
Creating the dataframe:
pd.DataFrame({ 'UniqueId': ['ID1','ID1','ID2','ID3','ID9','ID12'], 'Trip_Date': ['10-02-2020','15-02-2020','06-03-2020','29-01-2022','15-02-2020','19-06-2021'], 'Value': ['1','207','10','15','207','189'] })
What I want to do is to be able to merge on the UniqueID as well as the interval of the start date and end date inclusively. The resultant dataframe would look like the one below:
UniqueID,Start_Date,End_Date,Trip_Date,Value ID1,01-01-2020,01-08-2020,10-02-2020,1 ID1,01-01-2020,01-08-2020,15-02-2020,207 ID2,01-02-2020,01-04-2020,06-03-2020,10 ID3,01-03-2020,01-05-2020,NA,NA ID4,01-04-2020,01-09-2020,NA,NA ID5,01-05-2020,01-10-2020,NA,NA ID6,01-06-2020,01-11-2020,NA,NA df2.merge(df1, how='left', on='UniqueID')
The first method I have thought of using is to use an IntervalIndex on df1 and then merge based off that but then I have the issue of not being able to merge on the UniqueID and vice versa with UniqueID as a merge column. I kept with a left join when I merged df2 with df1 in order to preserve the original dataframe while merging any records of df1 with potential “matches” on df2.
I thought of potentially using a MultiIndex with an IntervalIndex as one of the levels and then the UniqueID as another but wasn’t sure how to go about this? Any ideas would be greatly appreciated!
The code below, should allow you to get the dataframes into a pandas df. Just make sure to copy and reassign.
df = pd.read_clipboard(sep=',') df1 = df.copy()
Advertisement
Answer
Merge your dataframe on your UniqueID
column then check if Trip_Date
is between Start_Date
and End_date
. Finally, set to nan
all rows where the condition is not met:
# Only if this columns have not datetime64 dtype df1['Start_Date'] = pd.to_datetime(df1['Start_Date'], dayfirst=True) df1['End_Date'] = pd.to_datetime(df1['End_Date'], dayfirst=True) df2['Trip_Date'] = pd.to_datetime(df2['Trip_Date'], dayfirst=True) out = pd.merge(df1, df2, on='UniqueID', how='left') m = out['Trip_Date'].between(out['Start_Date'], out['End_Date']) out.loc[~m, ['Trip_Date', 'Value']] = np.NaN
Output:
>>> out UniqueID Start_Date End_Date Trip_Date Value 0 ID1 2020-01-01 2020-08-01 2020-02-10 1.0 1 ID1 2020-01-01 2020-08-01 2020-02-15 207.0 2 ID2 2020-02-01 2020-04-01 2020-03-06 10.0 3 ID3 2020-03-01 2020-05-01 NaT NaN 4 ID4 2020-04-01 2020-09-01 NaT NaN 5 ID5 2020-05-01 2020-10-01 NaT NaN 6 ID6 2020-06-01 2020-11-01 NaT NaN