I am cleaning a dataset using the z-score with a threshold >3. Below is the code that I am using. As you can, I first calculate the mean and std. After the code goes in a loop and checks for every value the z-score and if it is greater than 3 and, if yes, the value is treated as an outlier which is first added to the list “outlier”. At last the outlier list is deleted for the dataset.
"""SD MonthlyIncome""" MonthlyIncome_std = df ['MonthlyIncome'].std() MonthlyIncome_std """MEAN MonthlyIncome""" MonthlyIncome_mean = df ['MonthlyIncome'].mean() MonthlyIncome_mean threshold = 3 outlier = [] for i in df ['MonthlyIncome']: z = (i-MonthlyIncome_mean)/MonthlyIncome_std if z >= threshold: outlier.append(i) df = df[~df.MonthlyIncome.isin(outlier)]
The above code works fine, the fact is that I have to write it for every numerical column. I was trying to create a function that does the same and it is replicable for every numerical column. Below the function:
for col in df.columns: if df[col].dtypes == 'float64' or df[col].dtypes == 'int64': threshold = 3 outlier = [] col_mean = col.mean() col_std = col.std() z = (i-col_mean)/col_std if z >= threshold: outlier.append(i) df = df[~df.col.isin(outlier)]
AttributeError Traceback (most recent call last) <ipython-input-62-4f8b1224061e> in <module> ----> 1 z_score_elimination(df) <ipython-input-61-dc3c84b60dd1> in z_score_elimination(df) 4 threshold = 3 5 outlier = [] ----> 6 col_mean = col.mean() 7 col_std = col.std() 8 z = (i-col_mean)/col_std AttributeError: 'str' object has no attribute 'mean'
How can I fix the code?
Advertisement
Answer
You are iterating over column names, which are string, not the actual columns. Try
df[col].mean()