Scikit-Learn IsolationForest Model

IsolationForest is an algorithm for anomaly detection that isolates observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature. It works well for detecting outliers in datasets.

The key hyperparameters of IsolationForest include n_estimators (number of trees), max_samples (number of samples to draw from X to train each base estimator), and contamination (proportion of outliers in the data set).

This algorithm is appropriate for anomaly detection problems, where the goal is to identify unusual data points.

from sklearn.datasets import make_blobs
from sklearn.ensemble import IsolationForest
from sklearn.metrics import accuracy_score
import numpy as np

# generate synthetic dataset with outliers
X, _ = make_blobs(n_samples=100, centers=1, cluster_std=1.0, random_state=1)
X = np.concatenate([X, [[8, 8], [9, 9], [10, 10]]], axis=0)  # add outliers

# create model
model = IsolationForest(n_estimators=100, contamination=0.1, random_state=1)

# fit model
model.fit(X)

# predict anomalies
yhat = model.predict(X)

# evaluate model
# converting -1 (outlier) to 1 and 1 (inlier) to 0 for accuracy calculation
y_true = [0 if label == 1 else 1 for label in yhat]
accuracy = accuracy_score(y_true, yhat)
print('Accuracy: %.3f' % accuracy)

# make a prediction
row = [[8, 8]]
yhat = model.predict(row)
print('Predicted: %d' % yhat[0])

Running the example gives an output like:

Accuracy: 0.000
Predicted: -1

The steps are as follows:

First, a synthetic dataset with a clear separation between normal data points and outliers is generated using make_blobs(). Specific outlier points are manually added to the dataset to ensure the presence of anomalies.
Next, an IsolationForest model is instantiated with hyperparameters such as n_estimators (number of trees), contamination (proportion of outliers), and random_state (for reproducibility). The model is then fit on the dataset using the fit() method.
Anomalies are predicted using the predict() method, where the model outputs -1 for outliers and 1 for inliers.
The model’s performance is evaluated by converting the predictions for accuracy calculation. In this case, converting -1 (outlier) to 1 and 1 (inlier) to 0 for the accuracy calculation helps measure the model’s accuracy.
Finally, a prediction is made for a new data point to demonstrate the model’s ability to identify outliers. The predict() method is used to determine if the new data point is an outlier.

See Also