Background: I have two Pandas DataFrames:
- DF1 represents known road segments with >= 7% truck traffic.
- DF2 represents all road segments in the study area.
Columns: SRI
is ‘standard route identifier’, MP_START
is ‘mile point start’, MP_END
is ‘mile point end’, and TRUCK_PCT
is ‘truck traffic percentage’.
Task: For each row in DF1
, the task is to check each record in DF2
to find two concurrent items:
- Find matching
SRI
records. - Within the matching
SRI
records, find the rows inDF2
whereDF1.MP_START
andDF1.MP_END
values are present within theDF2.MP_START
andDF2.MP_END
values. - Then assign the
DF1.TRUCK_PCT
value to the correct row inDF2
following the above condition. An example follows below.
Sample DF1:
DF1 = pd.DataFrame([['00000009__', 51.30, 52.26, '7%'], ['00000015__', 14.41, 14.71,'12%'], ['00000015__', 14.71, 15.45,'8%'], ['00000015__', 15.45, 15.724,'7%'], ['00000015__', 15.72, 16.25,'8%'], ['00000015__', 16.25, 16.31,'10%'], ['00000015__', 16.31, 16.70, '10%'], ['00000015__', 16.70, 16.89,'11%']], columns=['SRI', 'MP_START', 'MP_END', 'TRUCK_PCT'])
SRI MP_START MP_END TRUCK_PCT 0 00000009__ 51.30 52.260 7% 1 00000015__ 14.41 14.710 12% 2 00000015__ 14.71 15.450 8% 3 00000015__ 15.45 15.724 7% 4 00000015__ 15.72 16.250 8% 5 00000015__ 16.25 16.310 10% 6 00000015__ 16.31 16.700 10% 7 00000015__ 16.70 16.890 11%
Sample DF2:
DF2 = pd.DataFrame([['17031021__', 0.0, 0.10], ['03291236__', 0.0, 0.05], ['200006165__', 0.0, 0.12], ['00000009__', 52.36, 52.394], ['00000009__', 51.78, 52.36], ['00000009__', 49.09, 51.78], ['00000009__', 48.76, 48.76], ['00000015__', 15.45, 15.48]], columns=['SRI', 'MP_START', 'MP_END'])
SRI MP_START MP_END 0 17031021__ 0.00 0.100 1 03291236__ 0.00 0.050 2 200006165__ 0.00 0.120 3 00000009__ 52.36 52.394 4 00000009__ 51.78 52.360 5 00000009__ 49.09 51.780 6 00000009__ 48.76 48.760 7 00000015__ 15.45 15.480
Expected result:
For the row where
DF1.SRI == 00000009__
,DF1.MP_START == 51.30
andDF1.MP_END == 52.260
:DF1.MP_START
on row 0 (51.30
) falls within row 5 ofDF2
andDF1.MP_END
of row 0 (52.26
) falls within row 4. Therefore, bothDF2
rows 4 and 5 are assigned aTRUCK_PCT
value of 7%.Similarly, where
DF2.SRI == 00000015__
(row 7) this row would be assigned aTRUCK_PCT
of 7% because row 3 ofDF1
containsMP_START
andMP_END
values ofDF2
whereDF1.SRI == DF2.SRI (00000015__)
.
Sample DF3 Output:
SRI MP_START MP_END TRUCK_PCT 0 17031021__ 0.00 0.100 1 03291236__ 0.00 0.050 2 200006165__ 0.00 0.120 3 00000009__ 52.36 52.394 4 00000009__ 51.78 52.360 7% 5 00000009__ 49.09 51.780 7% 6 00000009__ 48.76 48.760 7 00000015__ 15.45 15.480 7%
Disclaimer: This is a confusing problem – please ask for more info where needed and I will try and clarify.
Advertisement
Answer
A possible solution would be to first merge the two dataframes on the SRI column (keeping all rows in DF2
) and then afterward find which rows fulfill the secondary condition of overlapping ranges.
First, merge the dataframes while keeping the index of DF2
for later usage:
DF2 = DF2.reset_index() df = pd.merge(DF2, DF1, on='SRI', how='left', suffixes=('', '_range'))
This results in the following dataframe:
index SRI MP_START MP_END MP_START_range MP_END_range TRUCK_PCT 0 0 17031021__ 0.00 0.100 NaN NaN NaN 1 1 03291236__ 0.00 0.050 NaN NaN NaN 2 2 200006165__ 0.00 0.120 NaN NaN NaN 3 3 00000009__ 52.36 52.394 51.30 52.260 7% 4 4 00000009__ 51.78 52.360 51.30 52.260 7% 5 5 00000009__ 49.09 51.780 51.30 52.260 7% 6 6 00000009__ 48.76 48.760 51.30 52.260 7% 7 7 00000015__ 15.45 15.480 14.41 14.710 12% 8 7 00000015__ 15.45 15.480 14.71 15.450 8% 9 7 00000015__ 15.45 15.480 15.45 15.724 7% 10 7 00000015__ 15.45 15.480 15.72 16.250 8% 11 7 00000015__ 15.45 15.480 16.25 16.310 10% 12 7 00000015__ 15.45 15.480 16.31 16.700 10% 13 7 00000015__ 15.45 15.480 16.70 16.890 11%
Now, we create a boolean index for the rows that have MP_START
or MP_END
within the two range columns. Any row that does not fulfill one of these conditions have their TRUCK_PCT
value set to nan:
cond_start = (df['MP_START'] >= df['MP_START_range']) & (df['MP_START'] < df['MP_END_range']) cond_end = (df['MP_END'] >= df['MP_START_range']) & (df['MP_END'] < df['MP_END_range']) df.loc[~(cond_start | cond_end), 'TRUCK_PCT'] = np.nan
Finally, we can drop the two range columns and group by the previously created index to get the final result. This will preserve any rows in DF2
that do not have a match in DF1
:
df = df.drop(columns=['MP_START_range', 'MP_END_range']).groupby('index').first().fillna('')
Result:
SRI MP_START MP_END TRUCK_PCT index 0 17031021__ 0.00 0.100 1 03291236__ 0.00 0.050 2 200006165__ 0.00 0.120 3 00000009__ 52.36 52.394 4 00000009__ 51.78 52.360 7% 5 00000009__ 49.09 51.780 7% 6 00000009__ 48.76 48.760 7 00000015__ 15.45 15.480 7%