SKLearner Home | About | Contact | Examples

Scikit-Learn make_hastie_10_2() Dataset

The make_hastie_10_2() function generates a synthetic dataset for binary classification tasks, commonly used to evaluate and compare classification algorithms.

This dataset comprises 11,000 samples with 10 features each. The first 10,000 samples are typically used for training, and the remaining 1,000 for testing. Key arguments for make_hastie_10_2() include n_samples to specify the number of samples and random_state for reproducibility.

This example demonstrates loading the dataset, exploring its shape and structure, and preparing it for model training.

from sklearn.datasets import make_hastie_10_2

# Generate the dataset
X, y = make_hastie_10_2(n_samples=11000, random_state=42)

# Display dataset shape and types
print(f"Dataset shape: {X.shape}")
print(f"Feature types: {X.dtype}, Target type: {y.dtype}")

# Show summary statistics
import pandas as pd
df = pd.DataFrame(X)
print(f"Summary statistics:\n{df.describe()}")

# Display first few rows of the dataset
print(f"First few rows of the dataset:\n{df.head()}")

# Split the dataset into training and testing sets
X_train, X_test = X[:10000], X[10000:]
y_train, y_test = y[:10000], y[10000:]
print(f"Training set shape: {X_train.shape}, {y_train.shape}")
print(f"Testing set shape: {X_test.shape}, {y_test.shape}")

Running the example gives an output like:

Dataset shape: (11000, 10)
Feature types: float64, Target type: float64
Summary statistics:
                  0             1  ...             8             9
count  11000.000000  11000.000000  ...  11000.000000  11000.000000
mean       0.007163     -0.000326  ...      0.006973     -0.001874
std        1.021626      0.989210  ...      0.997264      0.991970
min       -4.295391     -4.465604  ...     -3.419906     -4.157734
25%       -0.679656     -0.665051  ...     -0.668038     -0.673852
50%        0.014879     -0.018306  ...      0.007108      0.008414
75%        0.691326      0.665909  ...      0.679022      0.678536
max        3.727833      3.942331  ...      3.605591      3.852731

[8 rows x 10 columns]
First few rows of the dataset:
          0         1         2  ...         7         8         9
0  0.496714 -0.138264  0.647689  ...  0.767435 -0.469474  0.542560
1 -0.463418 -0.465730  0.241962  ...  0.314247 -0.908024 -1.412304
2  1.465649 -0.225776  0.067528  ...  0.375698 -0.600639 -0.291694
3 -0.601707  1.852278 -0.013497  ... -1.959670 -1.328186  0.196861
4  0.738467  0.171368 -0.115648  ...  1.057122  0.343618 -1.763040

[5 rows x 10 columns]
Training set shape: (10000, 10), (10000,)
Testing set shape: (1000, 10), (1000,)

The steps are as follows:

  1. Import the make_hastie_10_2 function from sklearn.datasets:

    • This function allows us to generate the synthetic Hastie dataset for binary classification.
  2. Generate the dataset using make_hastie_10_2():

    • Use n_samples=11000 to create a dataset with 11,000 samples.
    • Set random_state=42 to ensure reproducibility.
  3. Print the dataset shape and feature types:

    • Access the shape using X.shape.
    • Show the data types of the features and target using X.dtype and y.dtype.
  4. Display summary statistics:

    • Use pd.DataFrame(X).describe() to get a statistical summary of the dataset.
  5. Display the first few rows of the dataset:

    • Print the initial rows using pd.DataFrame(X).head() to get a sense of the dataset structure and content.
  6. Split the dataset into training and testing sets:

    • Separate the first 10,000 samples for training (X_train, y_train) and the remaining 1,000 for testing (X_test, y_test).
    • Print the shapes of X_train, y_train, X_test, and y_test to confirm the split.

This example demonstrates how to generate and explore the Hastie synthetic dataset using scikit-learn’s make_hastie_10_2() function, allowing you to inspect the data’s shape, types, summary statistics, and prepare it for classification tasks. This sets the stage for further preprocessing and application of classification algorithms.



See Also