Scikit-Learn make_friedman2() Dataset

Datasets

The make_friedman2 dataset is a synthetic dataset designed for regression tasks, commonly used for testing and benchmarking regression models.

Key function arguments when generating the dataset include n_samples to specify the number of samples and noise to add variability to the target variable.

This is a regression problem where algorithms like Linear Regression, Decision Trees, and Gradient Boosting can be applied.

from sklearn.datasets import make_friedman2

# Generate the dataset
X, y = make_friedman2(n_samples=100, noise=0.1, random_state=42)

# Display dataset shape and types
print(f"Dataset shape: {X.shape}")
print(f"Feature types: {type(X)}, {type(y)}")

# Show summary statistics of features
import pandas as pd
X_df = pd.DataFrame(X)
print(f"Summary statistics:\n{X_df.describe()}")

# Display first few rows of the features and target
print(f"First few rows of features:\n{X_df.head()}")
print(f"First few values of target:\n{y[:5]}")

Running the example gives an output like:

Dataset shape: (100, 4)
Feature types: <class 'numpy.ndarray'>, <class 'numpy.ndarray'>
Summary statistics:
                0            1           2           3
count  100.000000   100.000000  100.000000  100.000000
mean    49.799188   930.185910    0.498645    5.876502
std     30.921854   472.647106    0.294562    2.856814
min      0.506158   140.688269    0.020584    1.165878
25%     25.857039   514.488715    0.270935    3.389837
50%     52.076173   971.477746    0.507239    6.056249
75%     79.596344  1285.055203    0.730203    8.353911
max     96.361998  1743.043575    0.990505   10.717821
First few rows of features:
           0            1         2          3
0  37.454012  1678.777388  0.731994   6.986585
1  15.601864   380.500750  0.058084   9.661761
2  60.111501  1282.391023  0.020584  10.699099
3  83.244264   472.546861  0.181825   2.834045
4  30.424224   982.920600  0.431945   3.912291
First few values of target:
[1229.55598442   27.05490178   65.72038418  119.601185    425.68850791]

Import the make_friedman2 function from sklearn.datasets:
- This function generates the Friedman #2 regression problem dataset with five features.
Generate the dataset using make_friedman2():
- Use n_samples=100 to generate 100 samples.
- Use noise=0.1 to add a small amount of noise to the target variable.
- Set random_state=42 for reproducibility.
Print the dataset shape and feature types:
- The dataset is returned as a tuple (X, y) where X is the feature matrix and y is the target array.
Show summary statistics of the features:
- Convert X to a DataFrame for easier manipulation and display.
- Use X_df.describe() to get a statistical summary of the features.
Display the first few rows of the features and target:
- Print the initial rows using X_df.head() to inspect the features.
- Print the first few values of y to inspect the target variable.

This example demonstrates how to generate and explore the make_friedman2 dataset using scikit-learn, providing insights into the shape, types, and summary statistics of the data, which sets the stage for applying regression algorithms.

See Also