Skip to content
Advertisement

How to calculate the outliers in a Pandas dataframe while excluding NaN values

I have a pandas dataframe that should look like this.

   X     Y     Z           Is Outlier
0  9.5 -2.3   4.13         False
1  17.5 3.3   0.22         False
2  NaN  NaN  -5.67         NaN
3  547.16  11.17  -288.67  True
4  -0.05  3.55  6.78       False
...

Some values in this dataframe are outliers. I came across this method of calculating the outliers in every colum using the z score:

df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]

My goal is to create a column Is Outlier and put a True/False on each row that has/doesn’t have at least one outlier and NaN for rows with at least one NaN value and, at the same time, keep a count of all “True” values.

This is my code so far.

dt = np.fromfile(path, dtype='float')
df = pd.DataFrame(dt.reshape(-1, 3), column = ['X', 'Y', 'Z'])

How can I go about doing this?

Advertisement

Answer

If you consider NaN rows to be noise, you can compute the zscore dropping them, this will automatically give you NaNs when you assign the result:

from scipy.stats import zscore

thresh = 1

df['Is Outlier'] = zscore(df[['X', 'Y', 'Z']].dropna()).ge(thresh).any(1)

NB. I used at threshold of 1 for the example here.

Output:

        X      Y       Z Is Outlier
0    9.50  -2.30    4.13      False
1   17.50   3.30    0.22      False
2     NaN    NaN   -5.67        NaN
3  547.16  11.17 -288.67       True
4   -0.05   3.55    6.78      False

Alternatively, zscore has a nan_policy='omit' option, but this wouldn’t directly give you NaN in the output. The zscore computation however will use all values, including those from NaN rows. (This makes no difference in the final result here).

from scipy.stats import zscore

thresh = 1

df['Is Outlier'] = (zscore(df[['X', 'Y', 'Z']], nan_policy='omit')
                    .ge(thresh).any(1)
                    .mask(df[['X', 'Y', 'Z']].isna().any(1))
                    )

Output:

        X      Y       Z Is Outlier
0    9.50  -2.30    4.13      False
1   17.50   3.30    0.22      False
2     NaN    NaN   -5.67        NaN
3  547.16  11.17 -288.67       True
4   -0.05   3.55    6.78      False
User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement