Elimination of outliers with z-score method in Python

I am cleaning a dataset using the z-score with a threshold >3. Below is the code that I am using. As you can, I first calculate the mean and std. After the code goes in a loop and checks for every value the z-score and if it is greater than 3 and, if yes, the value is treated as an outlier which is first added to the list “outlier”. At last the outlier list is deleted for the dataset.

"""SD MonthlyIncome"""
MonthlyIncome_std = df ['MonthlyIncome'].std()
MonthlyIncome_std

"""MEAN MonthlyIncome"""
MonthlyIncome_mean = df ['MonthlyIncome'].mean()
MonthlyIncome_mean

threshold = 3
outlier = [] 
for i in df ['MonthlyIncome']: 
    z = (i-MonthlyIncome_mean)/MonthlyIncome_std 
    if z >= threshold: 
        outlier.append(i)
        df = df[~df.MonthlyIncome.isin(outlier)]

JavaScript
​x
 
"""SD MonthlyIncome"""
MonthlyIncome_std = df ['MonthlyIncome'].std()
MonthlyIncome_std
​
"""MEAN MonthlyIncome"""
MonthlyIncome_mean = df ['MonthlyIncome'].mean()
MonthlyIncome_mean
​
threshold = 3
outlier = [] 
for i in df ['MonthlyIncome']: 
    z = (i-MonthlyIncome_mean)/MonthlyIncome_std 
    if z >= threshold: 
        outlier.append(i)
        df = df[~df.MonthlyIncome.isin(outlier)]
​

The above code works fine, the fact is that I have to write it for every numerical column. I was trying to create a function that does the same and it is replicable for every numerical column. Below the function:

    for col in df.columns:
        if df[col].dtypes == 'float64' or df[col].dtypes == 'int64':
            threshold = 3
            outlier = []
            col_mean = col.mean()
            col_std = col.std()
            z = (i-col_mean)/col_std
            if z >= threshold: 
                outlier.append(i) 
                df = df[~df.col.isin(outlier)]

JavaScript
 
    for col in df.columns:
        if df[col].dtypes == 'float64' or df[col].dtypes == 'int64':
            threshold = 3
            outlier = []
            col_mean = col.mean()
            col_std = col.std()
            z = (i-col_mean)/col_std
            if z >= threshold: 
                outlier.append(i) 
                df = df[~df.col.isin(outlier)]
​

AttributeError                            Traceback (most recent call last)
<ipython-input-62-4f8b1224061e> in <module>
----> 1 z_score_elimination(df)

<ipython-input-61-dc3c84b60dd1> in z_score_elimination(df)
      4             threshold = 3
      5             outlier = []
----> 6             col_mean = col.mean()
      7             col_std = col.std()
      8             z = (i-col_mean)/col_std

AttributeError: 'str' object has no attribute 'mean'

JavaScript
 
AttributeError                            Traceback (most recent call last)
<ipython-input-62-4f8b1224061e> in <module>
----> 1 z_score_elimination(df)
​
<ipython-input-61-dc3c84b60dd1> in z_score_elimination(df)
      4             threshold = 3
      5             outlier = []
----> 6             col_mean = col.mean()
      7             col_std = col.std()
      8             z = (i-col_mean)/col_std
​
AttributeError: 'str' object has no attribute 'mean'
​

How can I fix the code?

Answer

You are iterating over column names, which are string, not the actual columns. Try

df[col].mean()

JavaScript
 
df[col].mean()
​

Advertisement

Answer