Anomaly detection is crucial for identifying unusual patterns that do not conform to expected behavior.
SGDOneClassSVM
is an incremental learning algorithm suitable for such tasks, particularly useful when dealing with large datasets. This example demonstrates how to use SGDOneClassSVM
to detect anomalies.
The key hyperparameters of SGDOneClassSVM
include:
nu
: An upper bound on the fraction of training errors and a lower bound on the fraction of support vectors.max_iter
: The maximum number of passes over the training data.tol
: Tolerance for the stopping criterion.
The algorithm is appropriate for anomaly detection problems.
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDOneClassSVM
from sklearn.metrics import accuracy_score
# generate dataset
X, _ = make_blobs(n_samples=100, centers=1, cluster_std=0.5, random_state=1)
# introduce some anomalies
import numpy as np
X = np.vstack([X, np.random.uniform(low=-4, high=4, size=(20, 2))])
# split into train and test sets
X_train, X_test = train_test_split(X, test_size=0.2, random_state=1)
# create model
model = SGDOneClassSVM(max_iter=1000, tol=1e-3, random_state=1)
# fit model
model.fit(X_train)
# evaluate model
yhat = model.predict(X_test)
# ground truth for evaluation (1 for inliers, -1 for outliers)
y_true = np.ones_like(yhat)
y_true[-4:] = -1 # assuming last 4 samples are anomalies
acc = accuracy_score(y_true, yhat)
print('Accuracy: %.3f' % acc)
# make a prediction
row = [[0.5, 0.5]]
yhat = model.predict(row)
print('Predicted: %d' % yhat[0])
Running the example gives an output like:
Accuracy: 0.750
Predicted: -1
The steps are as follows:
- Generate a synthetic dataset using
make_blobs()
, introducing some anomalies by adding random points. - Split the dataset into training and testing sets using
train_test_split()
. - Instantiate an
SGDOneClassSVM
model withmax_iter
,tol
, andrandom_state
hyperparameters. - Fit the model on the training data with the
fit()
method. - Evaluate the model’s performance by predicting the test set and comparing it to the ground truth, calculating the accuracy.
- Make a prediction on a new sample to check if it is considered an inlier or an outlier.
This example demonstrates how to use SGDOneClassSVM
for anomaly detection, showcasing its ability to handle incremental learning scenarios efficiently in scikit-learn.