Skip to content
Advertisement

Replace grouped columns’ outliers with mean of the group based on defined zscore

I have a very huge dataFrame with many datapoints on a map with outliers which are very close to each other on the dataset(Latitudes and longitudes). I would like to group all the rows as shown below for column A, calculate their zscores and replace every value within a group whose zscore is > 1.5 with the mean value for the group.

JavaScript

I have tried the zscore values table without success

JavaScript

Advertisement

Answer

You can use haversine_distances from scikit-learn to compute the distances between a point and the centroid of the point in the same group. Given that you should have very close points, you can approximate the latitude and longitude of the centroid with the mean of latitude and longitude of points in the group.

Here an example, based on data from UK towns (it is the free sample that you can download from here). In particular, the data contains for each city its coordinates and county (that you can think of as a group in your setting):

JavaScript

And here the code to change to solve your problem:

JavaScript

The resulting DataFrame df looks like:

JavaScript

If you look at outliers for Essex county, the new coordinates correspond to those of the centroid, i.e. (51.846594, 0.554532):

JavaScript
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement