Scikit-Learn make_friedman1() Dataset

Datasets

The make_friedman1 dataset is a synthetic dataset often used for testing regression algorithms. It simulates a complex regression problem with five input features and a continuous target variable.

Key arguments when generating the dataset include n_samples to specify the number of samples, n_features to define the number of features, and noise to add Gaussian noise to the output.

This is a regression problem where common algorithms like Linear Regression, Decision Trees, and Gradient Boosting Regressors are typically applied.

from sklearn.datasets import make_friedman1
import pandas as pd

# Generate the dataset
X, y = make_friedman1(n_samples=100, n_features=5, noise=0.0, random_state=42)

# Convert to DataFrame for easier manipulation
X_df = pd.DataFrame(X, columns=[f"feature_{i}" for i in range(1, 6)])
y_df = pd.Series(y, name="target")

# Display dataset shape and types
print(f"Dataset shape: {X_df.shape}")
print(f"Feature types:\n{X_df.dtypes}")

# Show summary statistics
print(f"Summary statistics:\n{X_df.describe()}")

# Display first few rows of the dataset
print(f"First few rows of the dataset:\n{X_df.head()}")

# Split the dataset into input and output elements
print(f"Input shape: {X_df.shape}")
print(f"Output shape: {y_df.shape}")

Running the example gives an output like:

Dataset shape: (100, 5)
Feature types:
feature_1    float64
feature_2    float64
feature_3    float64
feature_4    float64
feature_5    float64
dtype: object
Summary statistics:
        feature_1   feature_2   feature_3   feature_4   feature_5
count  100.000000  100.000000  100.000000  100.000000  100.000000
mean     0.504711    0.525232    0.487698    0.521254    0.453913
std      0.297463    0.299905    0.294821    0.293220    0.308130
min      0.009197    0.011354    0.005522    0.005062    0.015457
25%      0.279637    0.281719    0.193630    0.297294    0.167322
50%      0.502326    0.556379    0.513164    0.532713    0.419329
75%      0.773470    0.772236    0.746157    0.798420    0.726623
max      0.992965    0.990054    0.966655    0.974395    0.986887
First few rows of the dataset:
   feature_1  feature_2  feature_3  feature_4  feature_5
0   0.374540   0.950714   0.731994   0.598658   0.156019
1   0.155995   0.058084   0.866176   0.601115   0.708073
2   0.020584   0.969910   0.832443   0.212339   0.181825
3   0.183405   0.304242   0.524756   0.431945   0.291229
4   0.611853   0.139494   0.292145   0.366362   0.456070
Input shape: (100, 5)
Output shape: (100,)

The steps are as follows:

Import the make_friedman1 function from sklearn.datasets and pandas for data manipulation:
- The make_friedman1 function generates a synthetic regression dataset, while pandas helps with data manipulation and exploration.
Generate the dataset using make_friedman1():
- Specify n_samples=100 to create 100 samples, n_features=5 to generate five features, and random_state=42 for reproducibility.
Convert the generated arrays to a DataFrame and Series:
- Use pandas.DataFrame to create a DataFrame for the features and pandas.Series for the target variable.
Print the dataset shape and feature types:
- Access the shape using X_df.shape.
- Show the data types of the features using X_df.dtypes.
Display summary statistics:
- Use X_df.describe() to get a statistical summary of the dataset.
Display the first few rows of the dataset:
- Print the initial rows using X_df.head() to get a sense of the dataset structure and content.
Split the dataset into input and output elements:
- Confirm the shapes of X_df (features) and y_df (target) to ensure the data is correctly separated.

This example demonstrates how to quickly generate and explore the make_friedman1 synthetic dataset using scikit-learn’s make_friedman1() function, setting the stage for further preprocessing and application of regression algorithms.

See Also