The make_sparse_uncorrelated
dataset generates a synthetic dataset with a specified number of features and samples, where features are uncorrelated.
This dataset is useful for testing and benchmarking machine learning algorithms in scenarios where feature independence is assumed. Key function arguments include n_samples
to specify the number of samples, n_features
for the number of features, and random_state
for reproducibility.
This is a classification problem where algorithms such as Logistic Regression, Decision Trees, and Support Vector Machines can be applied.
from sklearn.datasets import make_sparse_uncorrelated
import pandas as pd
# Generate the dataset
X, y = make_sparse_uncorrelated(n_samples=100, n_features=10, random_state=42)
# Convert to DataFrame for easier manipulation
X_df = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(X.shape[1])])
y_df = pd.DataFrame(y, columns=['target'])
# Display dataset shape and types
print(f"Dataset shape: {X_df.shape}")
print(f"Feature types:\n{X_df.dtypes}")
# Show summary statistics
print(f"Summary statistics:\n{X_df.describe()}")
# Display first few rows of the dataset
print(f"First few rows of the dataset:\n{X_df.head()}")
# Split the dataset into input and output elements
print(f"Input shape: {X.shape}")
print(f"Output shape: {y.shape}")
Running the example gives an output like:
Dataset shape: (100, 10)
Feature types:
feature_0 float64
feature_1 float64
feature_2 float64
feature_3 float64
feature_4 float64
feature_5 float64
feature_6 float64
feature_7 float64
feature_8 float64
feature_9 float64
dtype: object
Summary statistics:
feature_0 feature_1 feature_2 ... feature_7 feature_8 feature_9
count 100.000000 100.000000 100.000000 ... 100.000000 100.000000 100.000000
mean 0.003618 0.002578 -0.048511 ... 0.038018 0.031172 0.058295
std 0.894297 1.047359 1.026363 ... 0.904248 0.921830 1.022648
min -1.918771 -2.301921 -3.241267 ... -2.423879 -2.650970 -1.987569
25% -0.574156 -0.710363 -0.700307 ... -0.572616 -0.532816 -0.628305
50% 0.079078 -0.052911 -0.023125 ... 0.094496 0.064005 -0.014331
75% 0.532233 0.804413 0.636509 ... 0.601449 0.552765 0.697232
max 2.526932 2.075401 2.560085 ... 2.455300 3.078881 3.852731
[8 rows x 10 columns]
First few rows of the dataset:
feature_0 feature_1 feature_2 ... feature_7 feature_8 feature_9
0 0.496714 -0.138264 0.647689 ... 0.767435 -0.469474 0.542560
1 -0.463418 -0.465730 0.241962 ... 0.314247 -0.908024 -1.412304
2 1.465649 -0.225776 0.067528 ... 0.375698 -0.600639 -0.291694
3 -0.601707 1.852278 -0.013497 ... -1.959670 -1.328186 0.196861
4 0.738467 0.171368 -0.115648 ... 1.057122 0.343618 -1.763040
[5 rows x 10 columns]
Input shape: (100, 10)
Output shape: (100,)
The steps are as follows:
Import the
make_sparse_uncorrelated
function fromsklearn.datasets
:- This function generates a synthetic dataset with specified properties for testing machine learning models.
Generate the dataset using
make_sparse_uncorrelated()
:- Specify
n_samples=100
for the number of samples andn_features=10
for the number of features. - Use
random_state=42
for reproducibility.
- Specify
Convert the dataset to a pandas DataFrame:
- This allows for easier manipulation and analysis of the dataset.
Print the dataset shape and feature types:
- Access the shape using
X_df.shape
. - Show the data types of the features using
X_df.dtypes
.
- Access the shape using
Display summary statistics:
- Use
X_df.describe()
to get a statistical summary of the dataset.
- Use
Display the first few rows of the dataset:
- Print the initial rows using
X_df.head()
to get a sense of the dataset structure and content.
- Print the initial rows using
Split the dataset into input and output elements:
- Print the shapes of
X
andy
to confirm the split.
- Print the shapes of
This example demonstrates how to generate and explore a synthetic dataset using scikit-learn’s make_sparse_uncorrelated
function, allowing you to inspect the data’s shape, types, summary statistics, and visualize a key feature. This sets the stage for further preprocessing and application of classification algorithms.