Replace grouped columns’ outliers with mean of the group based on defined zscore

Question

I have a very huge dataFrame with many datapoints on a map with outliers which are very close to each other on the dataset(Latitudes and longitudes). I would like to group all the rows as shown below for column A, calculate their zscores and replace every value within a group whose zscore is > 1.5 with the…

Accepted Answer

You can use haversine_distances from scikit-learn to compute the distances between a point and the centroid of the point in the same group. Given that you should have very close points, you can approximate the latitude and longitude of the centroid with the mean of latitude and longitude of points in the group.Here an example, based on data from UK towns (it is the free sample that you can download from here). In particular, the data contains for each city its coordinates and county (that you can think of as a group in your setting):                          name          county  latitude  longitude0                 Aaron's Hill          Surrey  51.18291   -0.630981                  Abbas Combe        Somerset  51.00283   -2.418252                     Abberley  Worcestershire  52.30522   -2.375743                     Abberton           Essex  51.83440    0.910664                     Abberton  Worcestershire  52.17955   -2.008175                    Abberwick  Northumberland  55.41325   -1.797206                   Abbess End           Essex  51.78000    0.281727                Abbess Roding           Essex  51.77815    0.276858                        Abbey           Devon  50.88896   -3.222769  Abbeycwmhir / Abaty Cwm-hir           Powys  52.33104   -3.38988And here the code to change to solve your problem:from math import radiansimport numpy as npimport pandas as pdfrom sklearn.metrics.pairwise import haversine_distancesdf = pd.read_csv('uk-towns-sample.csv', usecols=['name', 'county', 'latitude', 'longitude'])# Compute coordinates of the centroid for each county (group)dist_county = pd.DataFrame(df.groupby('county').agg({'latitude': np.mean, 'longitude': np.mean}))# Convert latitude and longitude to radians (it is needed by the function to compute haversine distance)df[['latitude_radians', 'longitude_radians']] = df[['latitude', 'longitude']].applymap(radians)dist_county[['latitude_radians', 'longitude_radians']] = dist_county[['latitude', 'longitude']].applymap(radians)# Compute the distance of each town w.r.t. the centroid of its conuntydf['dist'] = df[['county', 'latitude_radians', 'longitude_radians']].apply(    lambda x: haversine_distances(        [x[['latitude_radians', 'longitude_radians']].values],        [dist_county.loc[x['county']][['latitude_radians', 'longitude_radians']].values]    )[0][0] * 6371000/1000,  # multiply by Earth radius to get kilometers,    axis=1)# Compute mean and std of distances by countycounty_stats = df.groupby('county').agg({'dist': [np.mean, np.std]})# Compute the z-score using the distance of each town w.r.t. the centroid of its county, and the mean and std of distances for that countydf['zscore'] = df.apply(    lambda x: (x['dist'] - county_stats.loc[x['county']][('dist', 'mean')] ) / county_stats.loc[x['county']][('dist', 'std')],    axis=1)# Change latitude and longitude of the outliers with those of the centroid of their countiesdf.loc[df.zscore > 1.5, ['latitude', 'longitude']] = df[df.zscore > 1.5].merge(    dist_county, left_on='county', right_on=dist_county.index, how='left')[['latitude_y', 'longitude_y']].valuesThe resulting DataFrame df looks like:              name           county  latitude  longitude  latitude_radians  longitude_radians       dist    zscore0     Aaron's Hill           Surrey  51.18291   -0.63098          0.893310          -0.011013  12.479147 -0.2934191      Abbas Combe         Somerset  51.00283   -2.41825          0.890167          -0.042206  35.205157  1.0886952         Abberley   Worcestershire  52.30522   -2.37574          0.912898          -0.041464  17.014249  0.2661683         Abberton            Essex  51.83440    0.91066          0.904681           0.015894  24.504285 -0.2544004         Abberton   Worcestershire  52.17955   -2.00817          0.910705          -0.035049  11.906150 -0.663460...            ...              ...       ...        ...               ...                ...        ...       ...1795         Ayton     Berwickshire  55.84232   -2.12285          0.974632          -0.037051   5.899085  0.0078761796         Ayton    Tyne and Wear  54.89416   -1.55643          0.958084          -0.027165   3.192591 -0.935937If you look at outliers for Essex county, the new coordinates correspond to those of the centroid, i.e. (51.846594, 0.554532):             name county   latitude  longitude414   Aimes Green  Essex  51.846594   0.5545321721       Aveley  Essex  51.846594   0.554532

Advertisement

Answer