I am cleaning a dataset using the z-score with a threshold >3. Below is the code that I am using. As you can, I first calculate the mean and std. After the code goes in a loop and checks for every value the z-score and if it is greater than 3 and, if yes, the value is treated as an outlier which is first added to the list “outlier”. At last the outlier list is deleted for the dataset.
JavaScript
x
16
16
1
"""SD MonthlyIncome"""
2
MonthlyIncome_std = df ['MonthlyIncome'].std()
3
MonthlyIncome_std
4
5
"""MEAN MonthlyIncome"""
6
MonthlyIncome_mean = df ['MonthlyIncome'].mean()
7
MonthlyIncome_mean
8
9
threshold = 3
10
outlier = []
11
for i in df ['MonthlyIncome']:
12
z = (i-MonthlyIncome_mean)/MonthlyIncome_std
13
if z >= threshold:
14
outlier.append(i)
15
df = df[~df.MonthlyIncome.isin(outlier)]
16
The above code works fine, the fact is that I have to write it for every numerical column. I was trying to create a function that does the same and it is replicable for every numerical column. Below the function:
JavaScript
1
11
11
1
for col in df.columns:
2
if df[col].dtypes == 'float64' or df[col].dtypes == 'int64':
3
threshold = 3
4
outlier = []
5
col_mean = col.mean()
6
col_std = col.std()
7
z = (i-col_mean)/col_std
8
if z >= threshold:
9
outlier.append(i)
10
df = df[~df.col.isin(outlier)]
11
JavaScript
1
13
13
1
AttributeError Traceback (most recent call last)
2
<ipython-input-62-4f8b1224061e> in <module>
3
----> 1 z_score_elimination(df)
4
5
<ipython-input-61-dc3c84b60dd1> in z_score_elimination(df)
6
4 threshold = 3
7
5 outlier = []
8
----> 6 col_mean = col.mean()
9
7 col_std = col.std()
10
8 z = (i-col_mean)/col_std
11
12
AttributeError: 'str' object has no attribute 'mean'
13
How can I fix the code?
Advertisement
Answer
You are iterating over column names, which are string, not the actual columns. Try
JavaScript
1
2
1
df[col].mean()
2