Skip to content
Advertisement

Checking segment length in dataframe 1 against multiple segment instances in dataframe 2

Background: I have two Pandas DataFrames:

  • DF1 represents known road segments with >= 7% truck traffic.
  • DF2 represents all road segments in the study area.

Columns: SRI is ‘standard route identifier’, MP_START is ‘mile point start’, MP_END is ‘mile point end’, and TRUCK_PCT is ‘truck traffic percentage’.

Task: For each row in DF1, the task is to check each record in DF2 to find two concurrent items:

  1. Find matching SRI records.
  2. Within the matching SRI records, find the rows in DF2 where DF1.MP_START and DF1.MP_END values are present within the DF2.MP_START and DF2.MP_END values.
  3. Then assign the DF1.TRUCK_PCT value to the correct row in DF2 following the above condition. An example follows below.

Sample DF1:

DF1 = pd.DataFrame([['00000009__', 51.30, 52.26, '7%'], 
                   ['00000015__', 14.41, 14.71,'12%'], 
                   ['00000015__', 14.71, 15.45,'8%'],
                   ['00000015__', 15.45, 15.724,'7%'],
                   ['00000015__', 15.72, 16.25,'8%'],
                   ['00000015__', 16.25, 16.31,'10%'],
                   ['00000015__', 16.31, 16.70, '10%'],
                   ['00000015__', 16.70, 16.89,'11%']], columns=['SRI', 'MP_START', 'MP_END', 'TRUCK_PCT'])

         SRI   MP_START MP_END  TRUCK_PCT
0   00000009__  51.30   52.260  7%
1   00000015__  14.41   14.710  12%
2   00000015__  14.71   15.450  8%
3   00000015__  15.45   15.724  7%
4   00000015__  15.72   16.250  8%
5   00000015__  16.25   16.310  10%
6   00000015__  16.31   16.700  10%
7   00000015__  16.70   16.890  11%

Sample DF2:

DF2 = pd.DataFrame([['17031021__', 0.0, 0.10], 
                   ['03291236__', 0.0, 0.05], 
                   ['200006165__', 0.0, 0.12],
                   ['00000009__', 52.36, 52.394],
                   ['00000009__', 51.78, 52.36],
                   ['00000009__', 49.09, 51.78],
                   ['00000009__', 48.76, 48.76],
                   ['00000015__', 15.45, 15.48]], columns=['SRI', 'MP_START', 'MP_END'])

    SRI       MP_START  MP_END
0   17031021__  0.00    0.100
1   03291236__  0.00    0.050
2   200006165__ 0.00    0.120
3   00000009__  52.36   52.394
4   00000009__  51.78   52.360
5   00000009__  49.09   51.780
6   00000009__  48.76   48.760
7   00000015__  15.45   15.480

Expected result:

  • For the row where DF1.SRI == 00000009__, DF1.MP_START == 51.30 and DF1.MP_END == 52.260:

    DF1.MP_START on row 0 (51.30) falls within row 5 of DF2 and DF1.MP_END of row 0 (52.26) falls within row 4. Therefore, both DF2 rows 4 and 5 are assigned a TRUCK_PCT value of 7%.

  • Similarly, where DF2.SRI == 00000015__ (row 7) this row would be assigned a TRUCK_PCT of 7% because row 3 of DF1 contains MP_START and MP_END values of DF2 where DF1.SRI == DF2.SRI (00000015__).

Sample DF3 Output:

    SRI       MP_START  MP_END  TRUCK_PCT
0   17031021__  0.00    0.100   
1   03291236__  0.00    0.050
2   200006165__ 0.00    0.120
3   00000009__  52.36   52.394
4   00000009__  51.78   52.360   7%
5   00000009__  49.09   51.780   7%
6   00000009__  48.76   48.760
7   00000015__  15.45   15.480   7%

Disclaimer: This is a confusing problem – please ask for more info where needed and I will try and clarify.

Advertisement

Answer

A possible solution would be to first merge the two dataframes on the SRI column (keeping all rows in DF2) and then afterward find which rows fulfill the secondary condition of overlapping ranges.

First, merge the dataframes while keeping the index of DF2 for later usage:

DF2 = DF2.reset_index()
df = pd.merge(DF2, DF1, on='SRI', how='left', suffixes=('', '_range'))

This results in the following dataframe:

    index          SRI  MP_START  MP_END  MP_START_range  MP_END_range TRUCK_PCT
0       0   17031021__      0.00   0.100             NaN           NaN       NaN
1       1   03291236__      0.00   0.050             NaN           NaN       NaN
2       2  200006165__      0.00   0.120             NaN           NaN       NaN
3       3   00000009__     52.36  52.394           51.30        52.260        7%
4       4   00000009__     51.78  52.360           51.30        52.260        7%
5       5   00000009__     49.09  51.780           51.30        52.260        7%
6       6   00000009__     48.76  48.760           51.30        52.260        7%
7       7   00000015__     15.45  15.480           14.41        14.710       12%
8       7   00000015__     15.45  15.480           14.71        15.450        8%
9       7   00000015__     15.45  15.480           15.45        15.724        7%
10      7   00000015__     15.45  15.480           15.72        16.250        8%
11      7   00000015__     15.45  15.480           16.25        16.310       10%
12      7   00000015__     15.45  15.480           16.31        16.700       10%
13      7   00000015__     15.45  15.480           16.70        16.890       11%

Now, we create a boolean index for the rows that have MP_START or MP_END within the two range columns. Any row that does not fulfill one of these conditions have their TRUCK_PCT value set to nan:

cond_start = (df['MP_START'] >= df['MP_START_range']) & (df['MP_START'] < df['MP_END_range'])
cond_end = (df['MP_END'] >= df['MP_START_range']) & (df['MP_END'] < df['MP_END_range'])
df.loc[~(cond_start | cond_end), 'TRUCK_PCT'] = np.nan

Finally, we can drop the two range columns and group by the previously created index to get the final result. This will preserve any rows in DF2 that do not have a match in DF1:

df = df.drop(columns=['MP_START_range', 'MP_END_range']).groupby('index').first().fillna('')

Result:

               SRI  MP_START  MP_END TRUCK_PCT
index                                         
0       17031021__      0.00   0.100          
1       03291236__      0.00   0.050          
2      200006165__      0.00   0.120          
3       00000009__     52.36  52.394          
4       00000009__     51.78  52.360        7%
5       00000009__     49.09  51.780        7%
6       00000009__     48.76  48.760          
7       00000015__     15.45  15.480        7%
User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement