How to improve performance of dataframe slices matching?

Question

I need to improve the performance of the following dataframe slices matching. What I need to do is find the matching trips between 2 dataframes, according to the sequence column values with order conserved. My 2 dataframes: Expected output: This is the following code I&#8217; m using: Despite working, this is…

Accepted Answer

You can aggregate each trip as tuple with groupby.agg, then merge the two outputs to identify the identical routes:out = pd.merge(df1.groupby('trips', as_index=False)['sequence'].agg(tuple),               df2.groupby('trips', as_index=False)['sequence'].agg(tuple),               on='sequence'              )output:   trips_x sequence  trips_y0       11   (a, d)       121       11   (a, d)       32If you only want the first match, drop_duplicates the output of df2 aggregation to prevent unnecessary merging:out = pd.merge(df1.groupby('trips', as_index=False)['sequence'].agg(tuple),               df2.groupby('trips', as_index=False)['sequence'].agg(tuple)                  .drop_duplicates(subset='sequence'),               on='sequence'              )output:   trips_x sequence  trips_y0       11   (a, d)       12

Advertisement

Answer