I need to improve the performance of the following dataframe slices matching. What I need to do is find the matching trips between 2 dataframes, according to the sequence column values with order conserved.
My 2 dataframes:
>>>df1 trips sequence 0 11 a 1 11 d 2 21 d 3 21 a 4 31 a 5 31 b 6 31 c >>>df2 trips sequence 0 12 a 1 12 d 2 22 c 3 22 b 4 22 a 5 32 a 6 32 d
Expected output:
['11 match 12']
This is the following code I’ m using:
import pandas as pd import numpy as np df1 = pd.DataFrame({'trips': [11, 11, 21, 21, 31, 31, 31], 'sequence': ['a', 'd', 'd', 'a', 'a', 'b', 'c']}) df2 = pd.DataFrame({'trips': [12, 12, 22, 22, 22, 32, 32], 'sequence': ['a', 'd', 'c', 'b', 'a', 'a', 'd']}) route_match = [] for trip1 in df1['trips'].drop_duplicates(): for trip2 in df2['trips'].drop_duplicates(): route1 = df1[df1['trips'] == trip1]['sequence'] route2 = df2[df2['trips'] == trip2]['sequence'] if np.array_equal(route1.values,route2.values): route_match.append(str(trip1) + ' match ' + str(trip2)) break else: continue
Despite working, this is very time costly and unefficient as my real dataframes are longer. Any suggestions?
Advertisement
Answer
You can aggregate each trip as tuple with groupby.agg
, then merge
the two outputs to identify the identical routes:
out = pd.merge(df1.groupby('trips', as_index=False)['sequence'].agg(tuple), df2.groupby('trips', as_index=False)['sequence'].agg(tuple), on='sequence' )
output:
trips_x sequence trips_y 0 11 (a, d) 12 1 11 (a, d) 32
If you only want the first match, drop_duplicates
the output of df2
aggregation to prevent unnecessary merging:
out = pd.merge(df1.groupby('trips', as_index=False)['sequence'].agg(tuple), df2.groupby('trips', as_index=False)['sequence'].agg(tuple) .drop_duplicates(subset='sequence'), on='sequence' )
output:
trips_x sequence trips_y 0 11 (a, d) 12