Replace grouped columns’ outliers with mean of the group based on defined zscore

Tags: , ,



I have a very huge dataFrame with many datapoints on a map with outliers which are very close to each other on the dataset(Latitudes and longitudes). I would like to group all the rows as shown below for column A, calculate their zscores and replace every value within a group whose zscore is > 1.5 with the mean value for the group.

df =

[data][1]

I have tried the zscore values table without success

<**zscore = lambda x : (x - x.mean()) / x.std()
grouped_df = df.groupby("A")
transformed_df = grouped_df.transform(zscore)
transformed_df which gives me a table with zscores**>

Answer

You can use haversine_distances from scikit-learn to compute the distances between a point and the centroid of the point in the same group. Given that you should have very close points, you can approximate the latitude and longitude of the centroid with the mean of latitude and longitude of points in the group.

Here an example, based on data from UK towns (it is the free sample that you can download from here). In particular, the data contains for each city its coordinates and county (that you can think of as a group in your setting):

                          name          county  latitude  longitude
0                 Aaron's Hill          Surrey  51.18291   -0.63098
1                  Abbas Combe        Somerset  51.00283   -2.41825
2                     Abberley  Worcestershire  52.30522   -2.37574
3                     Abberton           Essex  51.83440    0.91066
4                     Abberton  Worcestershire  52.17955   -2.00817
5                    Abberwick  Northumberland  55.41325   -1.79720
6                   Abbess End           Essex  51.78000    0.28172
7                Abbess Roding           Essex  51.77815    0.27685
8                        Abbey           Devon  50.88896   -3.22276
9  Abbeycwmhir / Abaty Cwm-hir           Powys  52.33104   -3.38988

And here the code to change to solve your problem:

from math import radians

import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import haversine_distances

df = pd.read_csv('uk-towns-sample.csv', usecols=['name', 'county', 'latitude', 'longitude'])

# Compute coordinates of the centroid for each county (group)
dist_county = pd.DataFrame(df.groupby('county').agg({'latitude': np.mean, 'longitude': np.mean}))

# Convert latitude and longitude to radians (it is needed by the function to compute haversine distance)
df[['latitude_radians', 'longitude_radians']] = df[['latitude', 'longitude']].applymap(radians)
dist_county[['latitude_radians', 'longitude_radians']] = dist_county[['latitude', 'longitude']].applymap(radians)

# Compute the distance of each town w.r.t. the centroid of its conunty
df['dist'] = df[['county', 'latitude_radians', 'longitude_radians']].apply(
    lambda x: haversine_distances(
        [x[['latitude_radians', 'longitude_radians']].values],
        [dist_county.loc[x['county']][['latitude_radians', 'longitude_radians']].values]
    )[0][0] * 6371000/1000,  # multiply by Earth radius to get kilometers,
    axis=1
)

# Compute mean and std of distances by county
county_stats = df.groupby('county').agg({'dist': [np.mean, np.std]})

# Compute the z-score using the distance of each town w.r.t. the centroid of its county, and the mean and std of distances for that county
df['zscore'] = df.apply(
    lambda x: (x['dist'] - county_stats.loc[x['county']][('dist', 'mean')] ) / county_stats.loc[x['county']][('dist', 'std')],
    axis=1
)

# Change latitude and longitude of the outliers with those of the centroid of their counties
df.loc[df.zscore > 1.5, ['latitude', 'longitude']] = df[df.zscore > 1.5].merge(
    dist_county, left_on='county', right_on=dist_county.index, how='left'
)[['latitude_y', 'longitude_y']].values

The resulting DataFrame df looks like:

              name           county  latitude  longitude  latitude_radians  longitude_radians       dist    zscore
0     Aaron's Hill           Surrey  51.18291   -0.63098          0.893310          -0.011013  12.479147 -0.293419
1      Abbas Combe         Somerset  51.00283   -2.41825          0.890167          -0.042206  35.205157  1.088695
2         Abberley   Worcestershire  52.30522   -2.37574          0.912898          -0.041464  17.014249  0.266168
3         Abberton            Essex  51.83440    0.91066          0.904681           0.015894  24.504285 -0.254400
4         Abberton   Worcestershire  52.17955   -2.00817          0.910705          -0.035049  11.906150 -0.663460
...            ...              ...       ...        ...               ...                ...        ...       ...
1795         Ayton     Berwickshire  55.84232   -2.12285          0.974632          -0.037051   5.899085  0.007876
1796         Ayton    Tyne and Wear  54.89416   -1.55643          0.958084          -0.027165   3.192591 -0.935937

If you look at outliers for Essex county, the new coordinates correspond to those of the centroid, i.e. (51.846594, 0.554532):

             name county   latitude  longitude
414   Aimes Green  Essex  51.846594   0.554532
1721       Aveley  Essex  51.846594   0.554532


Source: stackoverflow