I need to improve the performance of the following dataframe slices matching. What I need to do is find the matching trips between 2 dataframes, according to the sequence column values with order conserved.
My 2 dataframes:
JavaScript
x
20
20
1
>>>df1
2
trips sequence
3
0 11 a
4
1 11 d
5
2 21 d
6
3 21 a
7
4 31 a
8
5 31 b
9
6 31 c
10
11
>>>df2
12
trips sequence
13
0 12 a
14
1 12 d
15
2 22 c
16
3 22 b
17
4 22 a
18
5 32 a
19
6 32 d
20
Expected output:
JavaScript
1
2
1
['11 match 12']
2
This is the following code I’ m using:
JavaScript
1
18
18
1
import pandas as pd
2
import numpy as np
3
4
df1 = pd.DataFrame({'trips': [11, 11, 21, 21, 31, 31, 31], 'sequence': ['a', 'd', 'd', 'a', 'a', 'b', 'c']})
5
6
df2 = pd.DataFrame({'trips': [12, 12, 22, 22, 22, 32, 32], 'sequence': ['a', 'd', 'c', 'b', 'a', 'a', 'd']})
7
8
route_match = []
9
for trip1 in df1['trips'].drop_duplicates():
10
for trip2 in df2['trips'].drop_duplicates():
11
route1 = df1[df1['trips'] == trip1]['sequence']
12
route2 = df2[df2['trips'] == trip2]['sequence']
13
if np.array_equal(route1.values,route2.values):
14
route_match.append(str(trip1) + ' match ' + str(trip2))
15
break
16
else:
17
continue
18
Despite working, this is very time costly and unefficient as my real dataframes are longer. Any suggestions?
Advertisement
Answer
You can aggregate each trip as tuple with groupby.agg
, then merge
the two outputs to identify the identical routes:
JavaScript
1
5
1
out = pd.merge(df1.groupby('trips', as_index=False)['sequence'].agg(tuple),
2
df2.groupby('trips', as_index=False)['sequence'].agg(tuple),
3
on='sequence'
4
)
5
output:
JavaScript
1
4
1
trips_x sequence trips_y
2
0 11 (a, d) 12
3
1 11 (a, d) 32
4
If you only want the first match, drop_duplicates
the output of df2
aggregation to prevent unnecessary merging:
JavaScript
1
6
1
out = pd.merge(df1.groupby('trips', as_index=False)['sequence'].agg(tuple),
2
df2.groupby('trips', as_index=False)['sequence'].agg(tuple)
3
.drop_duplicates(subset='sequence'),
4
on='sequence'
5
)
6
output:
JavaScript
1
3
1
trips_x sequence trips_y
2
0 11 (a, d) 12
3