SKLearner Home | About | Contact | Examples

Scikit-Learn make_circles() Dataset

Generating synthetic data for classification tasks can be crucial for testing and validating machine learning algorithms. The make_circles() function in scikit-learn creates a two-dimensional binary classification dataset with a circular decision boundary, often used to evaluate the performance of classifiers on non-linear problems.

Key function arguments include n_samples to specify the number of samples, noise to add Gaussian noise to the data, and factor to define the distance between the circles.

This is a binary classification problem where algorithms like Support Vector Machines with non-linear kernels, k-Nearest Neighbors, and neural networks are often applied.

from sklearn.datasets import make_circles
import matplotlib.pyplot as plt

# Generate the dataset
X, y = make_circles(n_samples=100, noise=0.1, factor=0.5, random_state=42)

# Display dataset shape and types
print(f"Dataset shape: {X.shape}")
print(f"Feature types: {type(X)}, {type(y)}")

# Show summary statistics
import numpy as np
print(f"Summary statistics:\nX mean: {np.mean(X, axis=0)}, X std: {np.std(X, axis=0)}")

# Display first few rows of the dataset
print(f"First few rows of the dataset:\nX[:5], y[:5]")

# Plot the dataset
plt.scatter(X[y == 0][:, 0], X[y == 0][:, 1], label='Class 0')
plt.scatter(X[y == 1][:, 0], X[y == 1][:, 1], label='Class 1')
plt.legend()
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('make_circles() Dataset')
plt.show()

# Split the dataset into input and output elements
print(f"Input shape: {X.shape}")
print(f"Output shape: {y.shape}")

Running the example gives an output like:

Dataset shape: (100, 2)
Feature types: <class 'numpy.ndarray'>, <class 'numpy.ndarray'>
Summary statistics:
X mean: [ 0.00478926 -0.00396879], X std: [0.56309332 0.56890655]
First few rows of the dataset:
X[:5], y[:5]
Input shape: (100, 2)
Output shape: (100,)

Scikit-Learn make_circles() Dataset

  1. Import the make_circles function from sklearn.datasets and matplotlib.pyplot for plotting:

    • make_circles is used to create synthetic circular data.
    • matplotlib.pyplot is used to visualize the data.
  2. Generate the dataset using make_circles():

    • Use n_samples=100 to create 100 samples.
    • Add noise=0.1 for variability.
    • Set factor=0.5 to determine the distance between the inner and outer circles.
    • Use random_state=42 for reproducibility.
  3. Print the dataset shape and feature types:

    • Access the shape using X.shape.
    • Show the data types of the features and target using type(X) and type(y).
  4. Display summary statistics:

    • Calculate and print the mean and standard deviation of X using np.mean(X, axis=0) and np.std(X, axis=0).
  5. Display the first few rows of the dataset:

    • Print the initial rows using X[:5] and y[:5] to get a sense of the dataset structure and content.
  6. Plot the dataset:

    • Use plt.scatter() to plot the data points, distinguishing classes by color and labels.
  7. Split the dataset into input and output elements:

    • Confirm the split by printing the shapes of X and y.

This example demonstrates how to use scikit-learn’s make_circles() function to generate and explore a synthetic dataset, visualize its structure, and prepare it for machine learning tasks.



See Also