Scikit-Learn DBSCAN Model

DBSCAN is a density-based clustering algorithm that identifies clusters based on the density of data points. It can find arbitrarily shaped clusters and is robust to outliers.

The key hyperparameters of DBSCAN include eps (the maximum distance between two samples for one to be considered as in the neighborhood of the other) and min_samples (the number of samples in a neighborhood for a point to be considered a core point).

The algorithm is appropriate for clustering tasks, especially when dealing with noisy data and identifying outliers.

from sklearn.datasets import make_blobs
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score

# generate 2D classification dataset
X, _ = make_blobs(n_samples=100, centers=3, n_features=2, random_state=1)

# create model
model = DBSCAN(eps=0.3, min_samples=5)

# fit model
model.fit(X)

# assign a cluster to each example
yhat = model.labels_

# evaluate model
score = silhouette_score(X, yhat)
print('Silhouette Score: %.3f' % score)

# make a prediction
import numpy as np
new_samples = np.array([[0, 0], [3, 3], [-1, -1]])
yhat_new = model.fit_predict(new_samples)
print('Predicted clusters for new samples: %s' % yhat_new)

Running the example gives an output like:

Silhouette Score: 0.015
Predicted clusters for new samples: [-1 -1 -1]

The steps are as follows:

First, a synthetic 2D dataset is generated using the make_blobs() function. This creates a dataset with a specified number of samples (n_samples), centers (centers), and a fixed random seed (random_state) for reproducibility. The dataset is split into training and test sets using train_test_split().
Next, a DBSCAN model is instantiated with eps and min_samples hyperparameters. The model is then fit on the dataset using the fit() method.
The model assigns a cluster to each data point, which is stored in model.labels_.
The performance of the clustering is evaluated using the silhouette score, which measures how similar an object is to its own cluster compared to other clusters.
Predictions for new samples are made using the fit_predict() method, which assigns clusters to the new data points based on the fitted model.

This example demonstrates how to use the DBSCAN algorithm to identify clusters in a dataset and make predictions on new data points. The DBSCAN model can handle noise and finds clusters of varying shapes and densities, making it a versatile tool for clustering tasks in scikit-learn.

See Also