Background: I have two Pandas DataFrames: DF1 represents known road segments with >= 7% truck traffic. DF2 represents all road segments in the study area. Columns: SRI is 'standard route identifier', MP_START is 'mile point start', MP_END is 'mile point end', and TRUCK_PCT is 'truck traffic percentage'. Task: For each row in DF1, the task is to check each record

Checking segment length in dataframe 1 against multiple segment instances in dataframe 2

Background: I have two Pandas DataFrames:

DF1 represents known road segments with >= 7% truck traffic.
DF2 represents all road segments in the study area.

Columns: SRI is ‘standard route identifier’, MP_START is ‘mile point start’, MP_END is ‘mile point end’, and TRUCK_PCT is ‘truck traffic percentage’.

Task: For each row in DF1, the task is to check each record in DF2 to find two concurrent items:

Find matching SRI records.
Within the matching SRI records, find the rows in DF2 where DF1.MP_START and DF1.MP_END values are present within the DF2.MP_START and DF2.MP_END values.
Then assign the DF1.TRUCK_PCT value to the correct row in DF2 following the above condition. An example follows below.

Sample DF1:

DF1 = pd.DataFrame([['00000009__', 51.30, 52.26, '7%'], 
                   ['00000015__', 14.41, 14.71,'12%'], 
                   ['00000015__', 14.71, 15.45,'8%'],
                   ['00000015__', 15.45, 15.724,'7%'],
                   ['00000015__', 15.72, 16.25,'8%'],
                   ['00000015__', 16.25, 16.31,'10%'],
                   ['00000015__', 16.31, 16.70, '10%'],
                   ['00000015__', 16.70, 16.89,'11%']], columns=['SRI', 'MP_START', 'MP_END', 'TRUCK_PCT'])

JavaScript
​x
 
DF1 = pd.DataFrame([['00000009__', 51.30, 52.26, '7%'], 
                   ['00000015__', 14.41, 14.71,'12%'], 
                   ['00000015__', 14.71, 15.45,'8%'],
                   ['00000015__', 15.45, 15.724,'7%'],
                   ['00000015__', 15.72, 16.25,'8%'],
                   ['00000015__', 16.25, 16.31,'10%'],
                   ['00000015__', 16.31, 16.70, '10%'],
                   ['00000015__', 16.70, 16.89,'11%']], columns=['SRI', 'MP_START', 'MP_END', 'TRUCK_PCT'])
​

         SRI   MP_START MP_END  TRUCK_PCT
0   00000009__  51.30   52.260  7%
1   00000015__  14.41   14.710  12%
2   00000015__  14.71   15.450  8%
3   00000015__  15.45   15.724  7%
4   00000015__  15.72   16.250  8%
5   00000015__  16.25   16.310  10%
6   00000015__  16.31   16.700  10%
7   00000015__  16.70   16.890  11%

JavaScript
 
         SRI   MP_START MP_END  TRUCK_PCT
 00000009__  51.30   52.260  7%
 00000015__  14.41   14.710  12%
 00000015__  14.71   15.450  8%
 00000015__  15.45   15.724  7%
 00000015__  15.72   16.250  8%
 00000015__  16.25   16.310  10%
 00000015__  16.31   16.700  10%
 00000015__  16.70   16.890  11%
​

Sample DF2:

DF2 = pd.DataFrame([['17031021__', 0.0, 0.10], 
                   ['03291236__', 0.0, 0.05], 
                   ['200006165__', 0.0, 0.12],
                   ['00000009__', 52.36, 52.394],
                   ['00000009__', 51.78, 52.36],
                   ['00000009__', 49.09, 51.78],
                   ['00000009__', 48.76, 48.76],
                   ['00000015__', 15.45, 15.48]], columns=['SRI', 'MP_START', 'MP_END'])

JavaScript
 
DF2 = pd.DataFrame([['17031021__', 0.0, 0.10], 
                   ['03291236__', 0.0, 0.05], 
                   ['200006165__', 0.0, 0.12],
                   ['00000009__', 52.36, 52.394],
                   ['00000009__', 51.78, 52.36],
                   ['00000009__', 49.09, 51.78],
                   ['00000009__', 48.76, 48.76],
                   ['00000015__', 15.45, 15.48]], columns=['SRI', 'MP_START', 'MP_END'])
​

    SRI       MP_START  MP_END
0   17031021__  0.00    0.100
1   03291236__  0.00    0.050
2   200006165__ 0.00    0.120
3   00000009__  52.36   52.394
4   00000009__  51.78   52.360
5   00000009__  49.09   51.780
6   00000009__  48.76   48.760
7   00000015__  15.45   15.480

JavaScript
 
    SRI       MP_START  MP_END
 17031021__  0.00    0.100
 03291236__  0.00    0.050
 200006165__ 0.00    0.120
 00000009__  52.36   52.394
 00000009__  51.78   52.360
 00000009__  49.09   51.780
 00000009__  48.76   48.760
 00000015__  15.45   15.480
​

Expected result:

For the row where DF1.SRI == 00000009__, DF1.MP_START == 51.30 and DF1.MP_END == 52.260:

DF1.MP_START on row 0 (51.30) falls within row 5 of DF2 and DF1.MP_END of row 0 (52.26) falls within row 4. Therefore, both DF2 rows 4 and 5 are assigned a TRUCK_PCT value of 7%.
Similarly, where DF2.SRI == 00000015__ (row 7) this row would be assigned a TRUCK_PCT of 7% because row 3 of DF1 contains MP_START and MP_END values of DF2 where DF1.SRI == DF2.SRI (00000015__).

Sample DF3 Output:

    SRI       MP_START  MP_END  TRUCK_PCT
0   17031021__  0.00    0.100   
1   03291236__  0.00    0.050
2   200006165__ 0.00    0.120
3   00000009__  52.36   52.394
4   00000009__  51.78   52.360   7%
5   00000009__  49.09   51.780   7%
6   00000009__  48.76   48.760
7   00000015__  15.45   15.480   7%

JavaScript
 
    SRI       MP_START  MP_END  TRUCK_PCT
 17031021__  0.00    0.100   
 03291236__  0.00    0.050
 200006165__ 0.00    0.120
 00000009__  52.36   52.394
 00000009__  51.78   52.360   7%
 00000009__  49.09   51.780   7%
 00000009__  48.76   48.760
 00000015__  15.45   15.480   7%
​

Disclaimer: This is a confusing problem – please ask for more info where needed and I will try and clarify.

Answer

A possible solution would be to first merge the two dataframes on the SRI column (keeping all rows in DF2) and then afterward find which rows fulfill the secondary condition of overlapping ranges.

First, merge the dataframes while keeping the index of DF2 for later usage:

DF2 = DF2.reset_index()
df = pd.merge(DF2, DF1, on='SRI', how='left', suffixes=('', '_range'))

JavaScript
 
DF2 = DF2.reset_index()
df = pd.merge(DF2, DF1, on='SRI', how='left', suffixes=('', '_range'))
​

This results in the following dataframe:

    index          SRI  MP_START  MP_END  MP_START_range  MP_END_range TRUCK_PCT
0       0   17031021__      0.00   0.100             NaN           NaN       NaN
1       1   03291236__      0.00   0.050             NaN           NaN       NaN
2       2  200006165__      0.00   0.120             NaN           NaN       NaN
3       3   00000009__     52.36  52.394           51.30        52.260        7%
4       4   00000009__     51.78  52.360           51.30        52.260        7%
5       5   00000009__     49.09  51.780           51.30        52.260        7%
6       6   00000009__     48.76  48.760           51.30        52.260        7%
7       7   00000015__     15.45  15.480           14.41        14.710       12%
8       7   00000015__     15.45  15.480           14.71        15.450        8%
9       7   00000015__     15.45  15.480           15.45        15.724        7%
10      7   00000015__     15.45  15.480           15.72        16.250        8%
11      7   00000015__     15.45  15.480           16.25        16.310       10%
12      7   00000015__     15.45  15.480           16.31        16.700       10%
13      7   00000015__     15.45  15.480           16.70        16.890       11%

JavaScript
 
    index          SRI  MP_START  MP_END  MP_START_range  MP_END_range TRUCK_PCT
     0   17031021__      0.00   0.100             NaN           NaN       NaN
     1   03291236__      0.00   0.050             NaN           NaN       NaN
     2  200006165__      0.00   0.120             NaN           NaN       NaN
     3   00000009__     52.36  52.394           51.30        52.260        7%
     4   00000009__     51.78  52.360           51.30        52.260        7%
     5   00000009__     49.09  51.780           51.30        52.260        7%
     6   00000009__     48.76  48.760           51.30        52.260        7%
     7   00000015__     15.45  15.480           14.41        14.710       12%
     7   00000015__     15.45  15.480           14.71        15.450        8%
     7   00000015__     15.45  15.480           15.45        15.724        7%
    7   00000015__     15.45  15.480           15.72        16.250        8%
    7   00000015__     15.45  15.480           16.25        16.310       10%
    7   00000015__     15.45  15.480           16.31        16.700       10%
    7   00000015__     15.45  15.480           16.70        16.890       11%
​

Now, we create a boolean index for the rows that have MP_START or MP_END within the two range columns. Any row that does not fulfill one of these conditions have their TRUCK_PCT value set to nan:

cond_start = (df['MP_START'] >= df['MP_START_range']) & (df['MP_START'] < df['MP_END_range'])
cond_end = (df['MP_END'] >= df['MP_START_range']) & (df['MP_END'] < df['MP_END_range'])
df.loc[~(cond_start | cond_end), 'TRUCK_PCT'] = np.nan

JavaScript
 
cond_start = (df['MP_START'] >= df['MP_START_range']) & (df['MP_START'] < df['MP_END_range'])
cond_end = (df['MP_END'] >= df['MP_START_range']) & (df['MP_END'] < df['MP_END_range'])
df.loc[~(cond_start | cond_end), 'TRUCK_PCT'] = np.nan
​

Finally, we can drop the two range columns and group by the previously created index to get the final result. This will preserve any rows in DF2 that do not have a match in DF1:

df = df.drop(columns=['MP_START_range', 'MP_END_range']).groupby('index').first().fillna('')

JavaScript
 
df = df.drop(columns=['MP_START_range', 'MP_END_range']).groupby('index').first().fillna('')
​

Result:

               SRI  MP_START  MP_END TRUCK_PCT
index                                         
0       17031021__      0.00   0.100          
1       03291236__      0.00   0.050          
2      200006165__      0.00   0.120          
3       00000009__     52.36  52.394          
4       00000009__     51.78  52.360        7%
5       00000009__     49.09  51.780        7%
6       00000009__     48.76  48.760          
7       00000015__     15.45  15.480        7%

JavaScript
 
               SRI  MP_START  MP_END TRUCK_PCT
index                                         
0       17031021__      0.00   0.100          
1       03291236__      0.00   0.050          
2      200006165__      0.00   0.120          
3       00000009__     52.36  52.394          
4       00000009__     51.78  52.360        7%
5       00000009__     49.09  51.780        7%
6       00000009__     48.76  48.760          
7       00000015__     15.45  15.480        7%
​

Advertisement

Answer