I have a very huge dataFrame with many datapoints on a map with outliers which are very close to each other on the dataset(Latitudes and longitudes). I would like to group all the rows as shown below for column A, calculate their zscores and replace every value within a group whose zscore is > 1.5 with the mean value for the group.
df = [data][1]
I have tried the zscore values table without success
<**zscore = lambda x : (x - x.mean()) / x.std() grouped_df = df.groupby("A") transformed_df = grouped_df.transform(zscore) transformed_df which gives me a table with zscores**>
Advertisement
Answer
You can use haversine_distances
from scikit-learn
to compute the distances between a point and the centroid of the point in the same group. Given that you should have very close points, you can approximate the latitude and longitude of the centroid with the mean of latitude and longitude of points in the group.
Here an example, based on data from UK towns (it is the free sample that you can download from here). In particular, the data contains for each city its coordinates and county (that you can think of as a group in your setting):
name county latitude longitude 0 Aaron's Hill Surrey 51.18291 -0.63098 1 Abbas Combe Somerset 51.00283 -2.41825 2 Abberley Worcestershire 52.30522 -2.37574 3 Abberton Essex 51.83440 0.91066 4 Abberton Worcestershire 52.17955 -2.00817 5 Abberwick Northumberland 55.41325 -1.79720 6 Abbess End Essex 51.78000 0.28172 7 Abbess Roding Essex 51.77815 0.27685 8 Abbey Devon 50.88896 -3.22276 9 Abbeycwmhir / Abaty Cwm-hir Powys 52.33104 -3.38988
And here the code to change to solve your problem:
from math import radians import numpy as np import pandas as pd from sklearn.metrics.pairwise import haversine_distances df = pd.read_csv('uk-towns-sample.csv', usecols=['name', 'county', 'latitude', 'longitude']) # Compute coordinates of the centroid for each county (group) dist_county = pd.DataFrame(df.groupby('county').agg({'latitude': np.mean, 'longitude': np.mean})) # Convert latitude and longitude to radians (it is needed by the function to compute haversine distance) df[['latitude_radians', 'longitude_radians']] = df[['latitude', 'longitude']].applymap(radians) dist_county[['latitude_radians', 'longitude_radians']] = dist_county[['latitude', 'longitude']].applymap(radians) # Compute the distance of each town w.r.t. the centroid of its conunty df['dist'] = df[['county', 'latitude_radians', 'longitude_radians']].apply( lambda x: haversine_distances( [x[['latitude_radians', 'longitude_radians']].values], [dist_county.loc[x['county']][['latitude_radians', 'longitude_radians']].values] )[0][0] * 6371000/1000, # multiply by Earth radius to get kilometers, axis=1 ) # Compute mean and std of distances by county county_stats = df.groupby('county').agg({'dist': [np.mean, np.std]}) # Compute the z-score using the distance of each town w.r.t. the centroid of its county, and the mean and std of distances for that county df['zscore'] = df.apply( lambda x: (x['dist'] - county_stats.loc[x['county']][('dist', 'mean')] ) / county_stats.loc[x['county']][('dist', 'std')], axis=1 ) # Change latitude and longitude of the outliers with those of the centroid of their counties df.loc[df.zscore > 1.5, ['latitude', 'longitude']] = df[df.zscore > 1.5].merge( dist_county, left_on='county', right_on=dist_county.index, how='left' )[['latitude_y', 'longitude_y']].values
The resulting DataFrame df
looks like:
name county latitude longitude latitude_radians longitude_radians dist zscore 0 Aaron's Hill Surrey 51.18291 -0.63098 0.893310 -0.011013 12.479147 -0.293419 1 Abbas Combe Somerset 51.00283 -2.41825 0.890167 -0.042206 35.205157 1.088695 2 Abberley Worcestershire 52.30522 -2.37574 0.912898 -0.041464 17.014249 0.266168 3 Abberton Essex 51.83440 0.91066 0.904681 0.015894 24.504285 -0.254400 4 Abberton Worcestershire 52.17955 -2.00817 0.910705 -0.035049 11.906150 -0.663460 ... ... ... ... ... ... ... ... ... 1795 Ayton Berwickshire 55.84232 -2.12285 0.974632 -0.037051 5.899085 0.007876 1796 Ayton Tyne and Wear 54.89416 -1.55643 0.958084 -0.027165 3.192591 -0.935937
If you look at outliers for Essex county, the new coordinates correspond to those of the centroid, i.e. (51.846594, 0.554532):
name county latitude longitude 414 Aimes Green Essex 51.846594 0.554532 1721 Aveley Essex 51.846594 0.554532