DBSCAN is a density-based clustering algorithm that identifies clusters based on the density of data points. It can find arbitrarily shaped clusters and is robust to outliers.
The key hyperparameters of DBSCAN
include eps
(the maximum distance between two samples for one to be considered as in the neighborhood of the other) and min_samples
(the number of samples in a neighborhood for a point to be considered a core point).
The algorithm is appropriate for clustering tasks, especially when dealing with noisy data and identifying outliers.
from sklearn.datasets import make_blobs
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score
# generate 2D classification dataset
X, _ = make_blobs(n_samples=100, centers=3, n_features=2, random_state=1)
# create model
model = DBSCAN(eps=0.3, min_samples=5)
# fit model
model.fit(X)
# assign a cluster to each example
yhat = model.labels_
# evaluate model
score = silhouette_score(X, yhat)
print('Silhouette Score: %.3f' % score)
# make a prediction
import numpy as np
new_samples = np.array([[0, 0], [3, 3], [-1, -1]])
yhat_new = model.fit_predict(new_samples)
print('Predicted clusters for new samples: %s' % yhat_new)
Running the example gives an output like:
Silhouette Score: 0.015
Predicted clusters for new samples: [-1 -1 -1]
The steps are as follows:
First, a synthetic 2D dataset is generated using the
make_blobs()
function. This creates a dataset with a specified number of samples (n_samples
), centers (centers
), and a fixed random seed (random_state
) for reproducibility. The dataset is split into training and test sets usingtrain_test_split()
.Next, a
DBSCAN
model is instantiated witheps
andmin_samples
hyperparameters. The model is then fit on the dataset using thefit()
method.The model assigns a cluster to each data point, which is stored in
model.labels_
.The performance of the clustering is evaluated using the silhouette score, which measures how similar an object is to its own cluster compared to other clusters.
Predictions for new samples are made using the
fit_predict()
method, which assigns clusters to the new data points based on the fitted model.
This example demonstrates how to use the DBSCAN
algorithm to identify clusters in a dataset and make predictions on new data points. The DBSCAN model can handle noise and finds clusters of varying shapes and densities, making it a versatile tool for clustering tasks in scikit-learn.