Can I perform a left join/merge between two dataframes using regular expressions with pandas?

Question

I am trying to perform a left merge using regular expressions in Python that can handle many-to-many relationships. Example: Answer You can use create a custom function to find all the matching indexes of both the data frames then extract those indexes and use pd.concat. Timeit results

Accepted Answer

You can use create a custom function to find all the matching indexes of both the data frames then extract those indexes and use pd.concat.import redef merge_regex(df1, df2):    idx = [(i,j) for i,r in enumerate(df1.regex) for j,v in enumerate(df2.col2) if re.match(r,v)]    df1_idx, df2_idx = zip(*idx)    t = df1.iloc[list(df1_idx),0].reset_index(drop=True)    t1 = df2.iloc[list(df2_idx),0].reset_index(drop=True)    return pd.concat([t,t1],axis=1)merge_regex(df1, df2)  col1 col20    a   ab1    a    a2    b   ab3    c   cd4    d   cdTimeit results# My solutionIn [292]: %timeit merge_regex(df1,df2)1.21 ms ± 22.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)#Chris's solutionIn [293]: %%timeit     ...: df1['matches'] = df1.apply(lambda r: [x for x in df2['col2'].values if re.findall(r['regex'], x)], axis=1)     ...:      ...: df1.set_index('col1').explode('matches').reset_index().drop(columns=['regex'])     ...:     ...:4.62 ms ± 25.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Advertisement

Answer