Need a synthetic regression dataset for testing and experimentation? Use make_regression
from scikit-learn to generate customizable regression datasets. The make_regression
function generates a dataset with a specified number of samples, features, noise level, and other parameters.
Key function arguments when generating the dataset include n_samples
to set the number of samples, typically a few hundred for quick testing, n_features
to specify the number of features, often set to 10-50 depending on the complexity desired, and noise
to define the standard deviation of Gaussian noise added to the output. This is a regression problem where algorithms like Linear Regression, Ridge Regression, Lasso Regression, and Decision Trees are often applied.
from sklearn.datasets import make_regression
import pandas as pd
# Generate the dataset
X, y = make_regression(n_samples=200, n_features=10, noise=0.1, random_state=42)
# Convert to pandas DataFrame for easier manipulation
X_df = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(X.shape[1])])
y_df = pd.DataFrame(y, columns=['target'])
# Display dataset shape and types
print(f"Dataset shape: {X_df.shape}, {y_df.shape}")
print(f"Feature types:\n{X_df.dtypes}\n{y_df.dtypes}")
# Show summary statistics
print(f"Summary statistics for features:\n{X_df.describe()}")
print(f"Summary statistics for target:\n{y_df.describe()}")
# Display first few rows of the dataset
print(f"First few rows of the features:\n{X_df.head()}")
print(f"First few rows of the target:\n{y_df.head()}")
Running the example gives an output like:
Dataset shape: (200, 10), (200, 1)
Feature types:
feature_0 float64
feature_1 float64
feature_2 float64
feature_3 float64
feature_4 float64
feature_5 float64
feature_6 float64
feature_7 float64
feature_8 float64
feature_9 float64
dtype: object
target float64
dtype: object
Summary statistics for features:
feature_0 feature_1 feature_2 ... feature_7 feature_8 feature_9
count 200.000000 200.000000 200.000000 ... 200.000000 200.000000 200.000000
mean 0.152737 0.006138 -0.048859 ... 0.114879 0.053059 0.038937
std 1.006109 0.989114 1.033414 ... 0.977359 0.906337 1.108375
min -2.872262 -3.241267 -2.696887 ... -2.619745 -2.650970 -2.940389
25% -0.563813 -0.644153 -0.793239 ... -0.545863 -0.552580 -0.565275
50% 0.092064 -0.000414 -0.022596 ... 0.003375 0.076043 0.004734
75% 0.763970 0.648317 0.613122 ... 0.756557 0.629461 0.812041
max 3.852731 2.560085 2.403416 ... 2.644343 3.078881 2.558199
[8 rows x 10 columns]
Summary statistics for target:
target
count 200.000000
mean 23.300977
std 174.917242
min -486.991074
25% -59.706875
50% 23.130069
75% 130.523370
max 493.064789
First few rows of the features:
feature_0 feature_1 feature_2 ... feature_7 feature_8 feature_9
0 -2.872262 0.323168 0.513600 ... -0.466037 -1.169917 -1.768439
1 -0.276813 -0.364953 2.056207 ... -1.044809 -0.221254 1.072507
2 -0.692421 -0.622649 -0.611769 ... -0.742471 -0.429302 1.695051
3 0.424166 2.075261 -0.651418 ... 1.735964 -0.320347 0.197600
4 0.751387 -0.238948 0.500917 ... -0.576771 0.099332 -0.050238
[5 rows x 10 columns]
First few rows of the target:
target
0 -379.387646
1 -374.210041
2 -242.702273
3 223.840598
4 -44.717138
The steps are as follows:
Import the
make_regression
function fromsklearn.datasets
:- This function allows us to generate a synthetic regression dataset with specified properties.
Generate the dataset using
make_regression()
:- Specify parameters like
n_samples
,n_features
,noise
, andrandom_state
for reproducibility.
- Specify parameters like
Convert the dataset to a pandas DataFrame:
- This facilitates easier data manipulation and analysis.
Print the dataset shape and feature types:
- Access the shape using
X_df.shape
andy_df.shape
. - Show the data types of the features and target using
X_df.dtypes
andy_df.dtypes
.
- Access the shape using
Display summary statistics:
- Use
X_df.describe()
andy_df.describe()
to get a statistical summary of the features and target.
- Use
Display the first few rows of the dataset:
- Print the initial rows using
X_df.head()
andy_df.head()
to get a sense of the dataset structure and content.
- Print the initial rows using
This example demonstrates how to quickly generate and explore a synthetic regression dataset using scikit-learn’s make_regression()
function, allowing you to inspect the data’s shape, types, summary statistics, and visualize key features. This sets the stage for further preprocessing and application of regression algorithms.