IsolationForest is an algorithm for anomaly detection that isolates observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature. It works well for detecting outliers in datasets.
The key hyperparameters of IsolationForest
include n_estimators
(number of trees), max_samples
(number of samples to draw from X to train each base estimator), and contamination
(proportion of outliers in the data set).
This algorithm is appropriate for anomaly detection problems, where the goal is to identify unusual data points.
from sklearn.datasets import make_blobs
from sklearn.ensemble import IsolationForest
from sklearn.metrics import accuracy_score
import numpy as np
# generate synthetic dataset with outliers
X, _ = make_blobs(n_samples=100, centers=1, cluster_std=1.0, random_state=1)
X = np.concatenate([X, [[8, 8], [9, 9], [10, 10]]], axis=0) # add outliers
# create model
model = IsolationForest(n_estimators=100, contamination=0.1, random_state=1)
# fit model
model.fit(X)
# predict anomalies
yhat = model.predict(X)
# evaluate model
# converting -1 (outlier) to 1 and 1 (inlier) to 0 for accuracy calculation
y_true = [0 if label == 1 else 1 for label in yhat]
accuracy = accuracy_score(y_true, yhat)
print('Accuracy: %.3f' % accuracy)
# make a prediction
row = [[8, 8]]
yhat = model.predict(row)
print('Predicted: %d' % yhat[0])
Running the example gives an output like:
Accuracy: 0.000
Predicted: -1
The steps are as follows:
First, a synthetic dataset with a clear separation between normal data points and outliers is generated using
make_blobs()
. Specific outlier points are manually added to the dataset to ensure the presence of anomalies.Next, an
IsolationForest
model is instantiated with hyperparameters such asn_estimators
(number of trees),contamination
(proportion of outliers), andrandom_state
(for reproducibility). The model is then fit on the dataset using thefit()
method.Anomalies are predicted using the
predict()
method, where the model outputs -1 for outliers and 1 for inliers.The model’s performance is evaluated by converting the predictions for accuracy calculation. In this case, converting -1 (outlier) to 1 and 1 (inlier) to 0 for the accuracy calculation helps measure the model’s accuracy.
Finally, a prediction is made for a new data point to demonstrate the model’s ability to identify outliers. The
predict()
method is used to determine if the new data point is an outlier.