How to calculate the outliers in a Pandas dataframe while excluding NaN values

Question

I have a pandas dataframe that should look like this. Some values in this dataframe are outliers. I came across this method of calculating the outliers in every colum using the z score: My goal is to create a column Is Outlier and put a True/False on each row that has/doesn't have at least one outlier and NaN for rows

Accepted Answer

If you consider NaN rows to be noise, you can compute the zscore dropping them, this will automatically give you NaNs when you assign the result:from scipy.stats import zscorethresh = 1df['Is Outlier'] = zscore(df[['X', 'Y', 'Z']].dropna()).ge(thresh).any(1)NB. I used at threshold of 1 for the example here.Output:        X      Y       Z Is Outlier0    9.50  -2.30    4.13      False1   17.50   3.30    0.22      False2     NaN    NaN   -5.67        NaN3  547.16  11.17 -288.67       True4   -0.05   3.55    6.78      FalseAlternatively, zscore has a nan_policy='omit' option, but this wouldn&#8217;t directly give you NaN in the output. The zscore computation however will use all values, including those from NaN rows. (This makes no difference in the final result here).from scipy.stats import zscorethresh = 1df['Is Outlier'] = (zscore(df[['X', 'Y', 'Z']], nan_policy='omit')                    .ge(thresh).any(1)                    .mask(df[['X', 'Y', 'Z']].isna().any(1))                    )Output:        X      Y       Z Is Outlier0    9.50  -2.30    4.13      False1   17.50   3.30    0.22      False2     NaN    NaN   -5.67        NaN3  547.16  11.17 -288.67       True4   -0.05   3.55    6.78      False

Advertisement

Answer