I’ve been trying to remove outliers from my database using isolation forest, but I can’t figure out how. I’ve seen the examples for credit card fraud and Salary but I can’t figure out how to apply them on each column as my database consists of 3862900 rows and 19 columns. I’ve uploaded an image of the head of my database. I can’t figure out how to apply isolation forest on each column then permanently remove these outliers.
Thank you.
Advertisement
Answer
According to the docs is used for detecting outliers not removing them
df = pd.DataFrame({'temp': [1,2,3,345,6,7,5345, 8, 9, 10, 11]}) clf = IsolationForest().fit(df['temp'].values.reshape(-1, 1)) clf.predict([[4], [5], [3636]])
array([ 1, 1, -1])
As you can see from the output 4
and 5
are not outliers but 3636 is.
If you want to remove outliers from your dataframe you should use the IQR
quant = df['temp'].quantile([0.25, 0.75]) df['temp'][~df['temp'].clip(*quant).isin(quant)]
4 6 5 7 7 8 8 9 9 10
As you can see the outliers have been removed
For the whole df
def IQR(df, colname, bounds = [.25, .75]): s = df[colname] q = s.quantile(bounds) return df[~s.clip(*q).isin(q)]
Note: Isolation forest cannot remove outliers from your dataset, it is used to detect new outliers