Scikit-Learn make_sparse_uncorrelated() Dataset

Datasets

The make_sparse_uncorrelated dataset generates a synthetic dataset with a specified number of features and samples, where features are uncorrelated.

This dataset is useful for testing and benchmarking machine learning algorithms in scenarios where feature independence is assumed. Key function arguments include n_samples to specify the number of samples, n_features for the number of features, and random_state for reproducibility.

This is a classification problem where algorithms such as Logistic Regression, Decision Trees, and Support Vector Machines can be applied.

from sklearn.datasets import make_sparse_uncorrelated
import pandas as pd

# Generate the dataset
X, y = make_sparse_uncorrelated(n_samples=100, n_features=10, random_state=42)

# Convert to DataFrame for easier manipulation
X_df = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(X.shape[1])])
y_df = pd.DataFrame(y, columns=['target'])

# Display dataset shape and types
print(f"Dataset shape: {X_df.shape}")
print(f"Feature types:\n{X_df.dtypes}")

# Show summary statistics
print(f"Summary statistics:\n{X_df.describe()}")

# Display first few rows of the dataset
print(f"First few rows of the dataset:\n{X_df.head()}")

# Split the dataset into input and output elements
print(f"Input shape: {X.shape}")
print(f"Output shape: {y.shape}")

Running the example gives an output like:

Dataset shape: (100, 10)
Feature types:
feature_0    float64
feature_1    float64
feature_2    float64
feature_3    float64
feature_4    float64
feature_5    float64
feature_6    float64
feature_7    float64
feature_8    float64
feature_9    float64
dtype: object
Summary statistics:
        feature_0   feature_1   feature_2  ...   feature_7   feature_8   feature_9
count  100.000000  100.000000  100.000000  ...  100.000000  100.000000  100.000000
mean     0.003618    0.002578   -0.048511  ...    0.038018    0.031172    0.058295
std      0.894297    1.047359    1.026363  ...    0.904248    0.921830    1.022648
min     -1.918771   -2.301921   -3.241267  ...   -2.423879   -2.650970   -1.987569
25%     -0.574156   -0.710363   -0.700307  ...   -0.572616   -0.532816   -0.628305
50%      0.079078   -0.052911   -0.023125  ...    0.094496    0.064005   -0.014331
75%      0.532233    0.804413    0.636509  ...    0.601449    0.552765    0.697232
max      2.526932    2.075401    2.560085  ...    2.455300    3.078881    3.852731

[8 rows x 10 columns]
First few rows of the dataset:
   feature_0  feature_1  feature_2  ...  feature_7  feature_8  feature_9
0   0.496714  -0.138264   0.647689  ...   0.767435  -0.469474   0.542560
1  -0.463418  -0.465730   0.241962  ...   0.314247  -0.908024  -1.412304
2   1.465649  -0.225776   0.067528  ...   0.375698  -0.600639  -0.291694
3  -0.601707   1.852278  -0.013497  ...  -1.959670  -1.328186   0.196861
4   0.738467   0.171368  -0.115648  ...   1.057122   0.343618  -1.763040

[5 rows x 10 columns]
Input shape: (100, 10)
Output shape: (100,)

The steps are as follows:

Import the make_sparse_uncorrelated function from sklearn.datasets:
- This function generates a synthetic dataset with specified properties for testing machine learning models.
Generate the dataset using make_sparse_uncorrelated():
- Specify n_samples=100 for the number of samples and n_features=10 for the number of features.
- Use random_state=42 for reproducibility.
Convert the dataset to a pandas DataFrame:
- This allows for easier manipulation and analysis of the dataset.
Print the dataset shape and feature types:
- Access the shape using X_df.shape.
- Show the data types of the features using X_df.dtypes.
Display summary statistics:
- Use X_df.describe() to get a statistical summary of the dataset.
Display the first few rows of the dataset:
- Print the initial rows using X_df.head() to get a sense of the dataset structure and content.
Split the dataset into input and output elements:
- Print the shapes of X and y to confirm the split.

This example demonstrates how to generate and explore a synthetic dataset using scikit-learn’s make_sparse_uncorrelated function, allowing you to inspect the data’s shape, types, summary statistics, and visualize a key feature. This sets the stage for further preprocessing and application of classification algorithms.

See Also