Scikit-Learn make_regression() Dataset

Datasets

Need a synthetic regression dataset for testing and experimentation? Use make_regression from scikit-learn to generate customizable regression datasets. The make_regression function generates a dataset with a specified number of samples, features, noise level, and other parameters.

Key function arguments when generating the dataset include n_samples to set the number of samples, typically a few hundred for quick testing, n_features to specify the number of features, often set to 10-50 depending on the complexity desired, and noise to define the standard deviation of Gaussian noise added to the output. This is a regression problem where algorithms like Linear Regression, Ridge Regression, Lasso Regression, and Decision Trees are often applied.

from sklearn.datasets import make_regression
import pandas as pd

# Generate the dataset
X, y = make_regression(n_samples=200, n_features=10, noise=0.1, random_state=42)

# Convert to pandas DataFrame for easier manipulation
X_df = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(X.shape[1])])
y_df = pd.DataFrame(y, columns=['target'])

# Display dataset shape and types
print(f"Dataset shape: {X_df.shape}, {y_df.shape}")
print(f"Feature types:\n{X_df.dtypes}\n{y_df.dtypes}")

# Show summary statistics
print(f"Summary statistics for features:\n{X_df.describe()}")
print(f"Summary statistics for target:\n{y_df.describe()}")

# Display first few rows of the dataset
print(f"First few rows of the features:\n{X_df.head()}")
print(f"First few rows of the target:\n{y_df.head()}")

Running the example gives an output like:

Dataset shape: (200, 10), (200, 1)
Feature types:
feature_0    float64
feature_1    float64
feature_2    float64
feature_3    float64
feature_4    float64
feature_5    float64
feature_6    float64
feature_7    float64
feature_8    float64
feature_9    float64
dtype: object
target    float64
dtype: object
Summary statistics for features:
        feature_0   feature_1   feature_2  ...   feature_7   feature_8   feature_9
count  200.000000  200.000000  200.000000  ...  200.000000  200.000000  200.000000
mean     0.152737    0.006138   -0.048859  ...    0.114879    0.053059    0.038937
std      1.006109    0.989114    1.033414  ...    0.977359    0.906337    1.108375
min     -2.872262   -3.241267   -2.696887  ...   -2.619745   -2.650970   -2.940389
25%     -0.563813   -0.644153   -0.793239  ...   -0.545863   -0.552580   -0.565275
50%      0.092064   -0.000414   -0.022596  ...    0.003375    0.076043    0.004734
75%      0.763970    0.648317    0.613122  ...    0.756557    0.629461    0.812041
max      3.852731    2.560085    2.403416  ...    2.644343    3.078881    2.558199

[8 rows x 10 columns]
Summary statistics for target:
           target
count  200.000000
mean    23.300977
std    174.917242
min   -486.991074
25%    -59.706875
50%     23.130069
75%    130.523370
max    493.064789
First few rows of the features:
   feature_0  feature_1  feature_2  ...  feature_7  feature_8  feature_9
0  -2.872262   0.323168   0.513600  ...  -0.466037  -1.169917  -1.768439
1  -0.276813  -0.364953   2.056207  ...  -1.044809  -0.221254   1.072507
2  -0.692421  -0.622649  -0.611769  ...  -0.742471  -0.429302   1.695051
3   0.424166   2.075261  -0.651418  ...   1.735964  -0.320347   0.197600
4   0.751387  -0.238948   0.500917  ...  -0.576771   0.099332  -0.050238

[5 rows x 10 columns]
First few rows of the target:
       target
0 -379.387646
1 -374.210041
2 -242.702273
3  223.840598
4  -44.717138

The steps are as follows:

Import the make_regression function from sklearn.datasets:
- This function allows us to generate a synthetic regression dataset with specified properties.
Generate the dataset using make_regression():
- Specify parameters like n_samples, n_features, noise, and random_state for reproducibility.
Convert the dataset to a pandas DataFrame:
- This facilitates easier data manipulation and analysis.
Print the dataset shape and feature types:
- Access the shape using X_df.shape and y_df.shape.
- Show the data types of the features and target using X_df.dtypes and y_df.dtypes.
Display summary statistics:
- Use X_df.describe() and y_df.describe() to get a statistical summary of the features and target.
Display the first few rows of the dataset:
- Print the initial rows using X_df.head() and y_df.head() to get a sense of the dataset structure and content.

This example demonstrates how to quickly generate and explore a synthetic regression dataset using scikit-learn’s make_regression() function, allowing you to inspect the data’s shape, types, summary statistics, and visualize key features. This sets the stage for further preprocessing and application of regression algorithms.

See Also