Skip to content
Advertisement

Isolation forest with multiple features detecting everything as an anomaly

I have an isolation forest implementation where I take the features (all are numerical); scale them to be between 0 and 1

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
data = scaler.fit_transform(df)
x = pd.DataFrame(data)

Then call predict:

import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest
clf = IsolationForest(max_samples=100, random_state=42).fit(x)
clf.predict(x)

In this instance, I have 23 numerical features.

When I run the script, it returns 1 for absolutely every result.

When I limit the feature set to 2 columns, it returns a mixture of 1 and -1.

How can I get around this?

Thanks

Advertisement

Answer

To sump up, what Isolation Forest does is count the number of splitting required to isolate one sample. To generate trees, it randomly selects a feature and then randomly select a split value between the maximum and minimum values of the selected feature.

The idea is that shorter paths will be probably, anomalies.

The problem you probably have is that you have several features that are not useful to differentiate anomalies. So the important features, are hidden because of the huge amount of “non-important” features. So, probably, your two features select in the second model are quite explanatory.

If you train an IsolationForest model with the most important features, the difference of number of splitting required to isolate one sample between normal sample and anomaly, will be bigger. So classify will be easier. Every problem will behaves better with different number of features.

So, to solve your problem, try to select the best features by understanding your real problem. Moreover, try to fit the model only with normal samples, or, at least, that the majority of samples (90%) are normal. If not, your model will learn that some anomalies are quiet common and categorized them as normal. However, if you know which values are anomalie in you data training, tune the hyperparameter contamination.

Advertisement