Skip to content
Advertisement

Is there a way to merge on Interval Index and another Column Value in pandas?

So I currently have 2 dataframes. These have different columns and what I have been trying to figure out is how to merge on an interval index as well as a unique ID value. Below are 2 different examples of the dataframes I have:

UniqueID,Start_Date,End_Date
ID1,01-01-2020,01-08-2020
ID2,01-02-2020,01-04-2020
ID3,01-03-2020,01-05-2020
ID4,01-04-2020,01-09-2020
ID5,01-05-2020,01-10-2020
ID6,01-06-2020,01-11-2020

Creating the dataframe:

pd.DataFrame({
    'UniqueId': ['ID1','ID2','ID3','ID4','ID5','ID6'],
    'Start_Date': ['01-01-2020','01-02-2020','01-03-2020','01-04-2020','01-05-2020','01-06-2020'],
    'End_Date': ['01-08-2020','01-04-2020','01-05-2020','01-09-2020','01-10-2020','01-11-2020']
})

UniqueID,Trip_Date,Value
ID1,10-02-2020,1
ID1,15-02-2020,207
ID2,06-03-2020,10
ID3,29-01-2022,15
ID9,15-02-2020,207
ID12,19-06-2021,189

Creating the dataframe:

pd.DataFrame({
    'UniqueId': ['ID1','ID1','ID2','ID3','ID9','ID12'],
    'Trip_Date': ['10-02-2020','15-02-2020','06-03-2020','29-01-2022','15-02-2020','19-06-2021'],
    'Value': ['1','207','10','15','207','189']
})

What I want to do is to be able to merge on the UniqueID as well as the interval of the start date and end date inclusively. The resultant dataframe would look like the one below:

UniqueID,Start_Date,End_Date,Trip_Date,Value
ID1,01-01-2020,01-08-2020,10-02-2020,1
ID1,01-01-2020,01-08-2020,15-02-2020,207
ID2,01-02-2020,01-04-2020,06-03-2020,10
ID3,01-03-2020,01-05-2020,NA,NA
ID4,01-04-2020,01-09-2020,NA,NA
ID5,01-05-2020,01-10-2020,NA,NA
ID6,01-06-2020,01-11-2020,NA,NA

df2.merge(df1, how='left', on='UniqueID')

The first method I have thought of using is to use an IntervalIndex on df1 and then merge based off that but then I have the issue of not being able to merge on the UniqueID and vice versa with UniqueID as a merge column. I kept with a left join when I merged df2 with df1 in order to preserve the original dataframe while merging any records of df1 with potential “matches” on df2.

I thought of potentially using a MultiIndex with an IntervalIndex as one of the levels and then the UniqueID as another but wasn’t sure how to go about this? Any ideas would be greatly appreciated!

The code below, should allow you to get the dataframes into a pandas df. Just make sure to copy and reassign.

df = pd.read_clipboard(sep=',')
df1 = df.copy()

Advertisement

Answer

Merge your dataframe on your UniqueID column then check if Trip_Date is between Start_Date and End_date. Finally, set to nan all rows where the condition is not met:

# Only if this columns have not datetime64 dtype
df1['Start_Date'] = pd.to_datetime(df1['Start_Date'], dayfirst=True)
df1['End_Date'] = pd.to_datetime(df1['End_Date'], dayfirst=True)
df2['Trip_Date'] = pd.to_datetime(df2['Trip_Date'], dayfirst=True)

out = pd.merge(df1, df2, on='UniqueID', how='left')
m = out['Trip_Date'].between(out['Start_Date'], out['End_Date'])

out.loc[~m, ['Trip_Date', 'Value']] = np.NaN

Output:

>>> out
  UniqueID Start_Date   End_Date  Trip_Date  Value
0      ID1 2020-01-01 2020-08-01 2020-02-10    1.0
1      ID1 2020-01-01 2020-08-01 2020-02-15  207.0
2      ID2 2020-02-01 2020-04-01 2020-03-06   10.0
3      ID3 2020-03-01 2020-05-01        NaT    NaN
4      ID4 2020-04-01 2020-09-01        NaT    NaN
5      ID5 2020-05-01 2020-10-01        NaT    NaN
6      ID6 2020-06-01 2020-11-01        NaT    NaN
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement