SKLearner Home | About | Contact | Examples

Scikit-Learn make_blobs() Dataset

Generate and utilize a synthetic dataset with make_blobs for clustering tasks, which creates Gaussian blobs of points for testing clustering algorithms.

The make_blobs function generates isotropic Gaussian blobs suitable for clustering.

It is ideal for creating simple, controlled datasets to test clustering algorithms. Key arguments include n_samples to specify the number of data points, n_features for the number of features per sample, centers for the number of cluster centers or their fixed locations, and cluster_std for the standard deviation of the clusters.

This dataset is for clustering problems, where algorithms like K-Means, DBSCAN, and Gaussian Mixture Models are often appropriate.

from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

# Generate the dataset
X, y = make_blobs(n_samples=300, n_features=2, centers=4, cluster_std=0.60, random_state=0)

# Display dataset shape and types
print(f"Dataset shape: {X.shape}")
print(f"First few rows of the dataset:\n{X[:5]}")

# Plot the dataset
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='viridis')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Generated Blobs Dataset')
plt.show()

# Split the dataset into input and output elements
print(f"Input shape: {X.shape}")
print(f"Output shape: {y.shape}")

Running the example gives an output like:

Dataset shape: (300, 2)
First few rows of the dataset:
[[ 0.83685684  2.13635938]
 [-1.4136581   7.40962324]
 [ 1.15521298  5.09961887]
 [-1.01861632  7.81491465]
 [ 1.27135141  1.89254207]]

Input shape: (300, 2)
Output shape: (300,)

Scikit-Learn make_blobs() Dataset

  1. Import the make_blobs function from sklearn.datasets and matplotlib.pyplot for plotting:

    • This function generates synthetic data suitable for clustering.
  2. Generate the dataset using make_blobs():

    • Specify n_samples=300 to create 300 data points.
    • Use n_features=2 for a 2-dimensional dataset.
    • Set centers=4 to generate 4 clusters.
    • Adjust cluster_std=0.60 for the standard deviation of the clusters.
    • Use random_state=0 for reproducibility.
  3. Print the dataset shape and display the first few rows:

    • Show the shape using X.shape.
    • Print the first five rows of the dataset with X[:5] to inspect the data.
  4. Plot the dataset:

    • Use plt.scatter() to create a scatter plot of the generated data points, colored by their cluster label.
    • Label the axes and give the plot a title.
  5. Split the dataset into input and output elements:

    • Confirm the shapes of the feature set X and the labels y.

This example demonstrates how to generate and visualize a synthetic dataset using make_blobs, which can be useful for testing and experimenting with various clustering algorithms.



See Also