I need to find ‘user_id’ of users standing closeby to each other. So we have data:
JavaScript
x
10
10
1
import pandas as pd
2
3
d = {'user_id': [11,24,101,214,302,335],
4
'worker_latitude': [-34.6209, -2.7572, 55.6621,
5
55.114462, 55.6622,-34.6209],
6
'worker_longitude': [-58.3742, 52.3879, 56.6621, 38.927156,
7
56.6622, 39.018]}
8
df = pd.DataFrame(data=d)
9
df
10
JavaScript
1
8
1
user_id worker_latitude worker_longitude
2
0 11 -34.620900 -58.374200
3
1 24 -2.757200 52.387900
4
2 101 55.662100 56.662100
5
3 214 55.114462 38.927156
6
4 302 55.662200 56.662200
7
5 335 -34.620900 39.018000
8
So, in this dataset it would be users with id ‘101’ and ‘302’. But our dataset has millions of lines in it. Are there any built-in functions in pandas or python to solve the issue?
Advertisement
Answer
Assuming the workers need to share the same location to be considered standing closeby, a groupby by location can match workers efficiently:
JavaScript
1
15
15
1
from itertools import combinations
2
3
import pandas as pd
4
5
d = {'user_id': [11, 24, 101, 214, 302, 335],
6
'worker_latitude': [-34.6209, -2.7572, 55.6621,
7
55.114462, 55.6621, -34.6209],
8
'worker_longitude': [-58.3742, 52.3879, 56.6621, 38.927156,
9
56.6621, 39.018]}
10
df = pd.DataFrame(data=d)
11
12
matched_workers = df.groupby(['worker_latitude', 'worker_longitude']).apply(
13
lambda rows: list(combinations(rows['user_id'], r=2)))
14
matched_workers = matched_workers.loc[matched_workers.apply(bool)]
15
Which outputs:
JavaScript
1
4
1
worker_latitude worker_longitude
2
55.6621 56.6621 [(101, 302)]
3
dtype: object
4