SKLearner Home | About | Contact | Examples

Scikit-Learn LocalOutlierFactor Model

Local Outlier Factor (LOF) is an algorithm used for identifying outliers in a dataset. It evaluates the local density deviation of a given data point with respect to its neighbors, identifying samples that have a significantly lower density than their neighbors as outliers.

Key hyperparameters include n_neighbors (number of neighbors to use for computing the local density), algorithm (the algorithm to use for nearest neighbors computation), and contamination (the expected proportion of outliers in the dataset).

The algorithm is appropriate for anomaly detection problems.

from sklearn.datasets import make_blobs
from sklearn.neighbors import LocalOutlierFactor
import numpy as np

# generate 2D synthetic dataset
X, _ = make_blobs(n_samples=100, centers=1, cluster_std=0.5, random_state=42)

# introduce some outliers
np.random.seed(42)
X_outliers = np.random.uniform(low=-4, high=4, size=(10, 2))
X = np.vstack([X, X_outliers])

# fit the model
model = LocalOutlierFactor(n_neighbors=20, contamination=0.1)
yhat = model.fit_predict(X)

# identify outliers
mask = yhat == -1
outliers = X[mask]

print(f'Number of outliers detected: {len(outliers)}')
print('Outliers:\n', outliers)

Running the example gives an output like:

Number of outliers detected: 11
Outliers:
 [[-3.81907018  9.42523738]
 [-1.00367905  3.60571445]
 [ 1.85595153  0.78926787]
 [-2.75185088 -2.75204384]
 [-3.5353311   2.92940917]
 [ 0.80892009  1.66458062]
 [-3.83532405  3.75927882]
 [ 2.65954113 -2.30128711]
 [-2.54540026 -2.53276392]
 [-1.56606206  0.19805145]
 [-0.54443985 -1.67016688]]

The steps are as follows:

  1. First, a synthetic 2D dataset is generated using the make_blobs() function. This creates a dataset with a specified number of samples (n_samples), cluster centers (centers), and a fixed random seed (random_state) for reproducibility.

  2. Next, outliers are introduced by adding random samples uniformly distributed within a specified range. These samples are combined with the original dataset to form a new dataset with both normal samples and outliers.

  3. A LocalOutlierFactor model is instantiated with n_neighbors set to 20 and contamination set to 0.1. The model is fit on the dataset using the fit_predict() method, which returns predictions indicating whether each sample is an outlier.

  4. Outliers are identified by filtering the samples where the prediction is -1, indicating they are considered outliers by the model.

This example demonstrates using LocalOutlierFactor to detect outliers in a dataset, showcasing how to configure and apply the model effectively for anomaly detection tasks.



See Also