Skip to content
Advertisement

How to find lines in pandas columns with close values?

I need to find ‘user_id’ of users standing closeby to each other. So we have data:

import pandas as pd

d = {'user_id': [11,24,101,214,302,335],
            'worker_latitude': [-34.6209, -2.7572, 55.6621, 
55.114462, 55.6622,-34.6209], 
            'worker_longitude': [-58.3742, 52.3879, 56.6621, 38.927156,
 56.6622, 39.018]}
df = pd.DataFrame(data=d)
df
   user_id  worker_latitude  worker_longitude
0       11       -34.620900        -58.374200
1       24        -2.757200         52.387900
2      101        55.662100         56.662100
3      214        55.114462         38.927156
4      302        55.662200         56.662200
5      335       -34.620900         39.018000

So, in this dataset it would be users with id ‘101’ and ‘302’. But our dataset has millions of lines in it. Are there any built-in functions in pandas or python to solve the issue?

Advertisement

Answer

Assuming the workers need to share the same location to be considered standing closeby, a groupby by location can match workers efficiently:

from itertools import combinations

import pandas as pd

d = {'user_id': [11, 24, 101, 214, 302, 335],
     'worker_latitude': [-34.6209, -2.7572, 55.6621,
                         55.114462, 55.6621, -34.6209],
     'worker_longitude': [-58.3742, 52.3879, 56.6621, 38.927156,
                          56.6621, 39.018]}
df = pd.DataFrame(data=d)

matched_workers = df.groupby(['worker_latitude', 'worker_longitude']).apply(
    lambda rows: list(combinations(rows['user_id'], r=2)))
matched_workers = matched_workers.loc[matched_workers.apply(bool)]

Which outputs:

worker_latitude  worker_longitude
55.6621          56.6621             [(101, 302)]
dtype: object
User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement