I need to find ‘user_id’ of users standing closeby to each other. So we have data:
import pandas as pd d = {'user_id': [11,24,101,214,302,335], 'worker_latitude': [-34.6209, -2.7572, 55.6621, 55.114462, 55.6622,-34.6209], 'worker_longitude': [-58.3742, 52.3879, 56.6621, 38.927156, 56.6622, 39.018]} df = pd.DataFrame(data=d) df
user_id worker_latitude worker_longitude 0 11 -34.620900 -58.374200 1 24 -2.757200 52.387900 2 101 55.662100 56.662100 3 214 55.114462 38.927156 4 302 55.662200 56.662200 5 335 -34.620900 39.018000
So, in this dataset it would be users with id ‘101’ and ‘302’. But our dataset has millions of lines in it. Are there any built-in functions in pandas or python to solve the issue?
Advertisement
Answer
Assuming the workers need to share the same location to be considered standing closeby, a groupby by location can match workers efficiently:
from itertools import combinations import pandas as pd d = {'user_id': [11, 24, 101, 214, 302, 335], 'worker_latitude': [-34.6209, -2.7572, 55.6621, 55.114462, 55.6621, -34.6209], 'worker_longitude': [-58.3742, 52.3879, 56.6621, 38.927156, 56.6621, 39.018]} df = pd.DataFrame(data=d) matched_workers = df.groupby(['worker_latitude', 'worker_longitude']).apply( lambda rows: list(combinations(rows['user_id'], r=2))) matched_workers = matched_workers.loc[matched_workers.apply(bool)]
Which outputs:
worker_latitude worker_longitude 55.6621 56.6621 [(101, 302)] dtype: object