I am trying to perform a left merge using regular expressions in Python that can handle many-to-many relationships. Example:
JavaScript
x
29
29
1
df1 = pd.DataFrame(['a','b','c','d'], columns = ['col1'])
2
df1['regex'] = '.*' + df1['col1'] + '.*'
3
4
col1 regex
5
0 a .*a.*
6
1 b .*b.*
7
2 c .*c.*
8
3 d .*d.*
9
10
df2 = pd.DataFrame(['ab','a','cd'], columns = ['col2'])
11
12
col2
13
0 ab
14
1 a
15
2 cd
16
17
# Merge on regex column to col2
18
19
out = pd.DataFrame([['a','ab'],['a','a'],['b','ab'],['c','cd'],
20
['d','cd']],columns = ['col1','col2'])
21
22
23
col1 col2
24
0 a ab
25
1 a a
26
2 b ab
27
3 c cd
28
4 d cd
29
Advertisement
Answer
You can use create a custom function to find all the matching indexes of both the data frames then extract those indexes and use pd.concat
.
JavaScript
1
16
16
1
import re
2
def merge_regex(df1, df2):
3
idx = [(i,j) for i,r in enumerate(df1.regex) for j,v in enumerate(df2.col2) if re.match(r,v)]
4
df1_idx, df2_idx = zip(*idx)
5
t = df1.iloc[list(df1_idx),0].reset_index(drop=True)
6
t1 = df2.iloc[list(df2_idx),0].reset_index(drop=True)
7
return pd.concat([t,t1],axis=1)
8
9
merge_regex(df1, df2)
10
col1 col2
11
0 a ab
12
1 a a
13
2 b ab
14
3 c cd
15
4 d cd
16
Timeit results
JavaScript
1
13
13
1
# My solution
2
In [292]: %timeit merge_regex(df1,df2)
3
1.21 ms ± 22.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
4
5
#Chris's solution
6
In [293]: %%timeit
7
df1['matches'] = df1.apply(lambda r: [x for x in df2['col2'].values if re.findall(r['regex'], x)], axis=1) :
8
:
9
df1.set_index('col1').explode('matches').reset_index().drop(columns=['regex']) :
10
:
11
:
12
4.62 ms ± 25.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
13