SKLearner Home | About | Contact | Examples

Scikit-Learn make_friedman1() Dataset

The make_friedman1 dataset is a synthetic dataset often used for testing regression algorithms. It simulates a complex regression problem with five input features and a continuous target variable.

Key arguments when generating the dataset include n_samples to specify the number of samples, n_features to define the number of features, and noise to add Gaussian noise to the output.

This is a regression problem where common algorithms like Linear Regression, Decision Trees, and Gradient Boosting Regressors are typically applied.

from sklearn.datasets import make_friedman1
import pandas as pd

# Generate the dataset
X, y = make_friedman1(n_samples=100, n_features=5, noise=0.0, random_state=42)

# Convert to DataFrame for easier manipulation
X_df = pd.DataFrame(X, columns=[f"feature_{i}" for i in range(1, 6)])
y_df = pd.Series(y, name="target")

# Display dataset shape and types
print(f"Dataset shape: {X_df.shape}")
print(f"Feature types:\n{X_df.dtypes}")

# Show summary statistics
print(f"Summary statistics:\n{X_df.describe()}")

# Display first few rows of the dataset
print(f"First few rows of the dataset:\n{X_df.head()}")

# Split the dataset into input and output elements
print(f"Input shape: {X_df.shape}")
print(f"Output shape: {y_df.shape}")

Running the example gives an output like:

Dataset shape: (100, 5)
Feature types:
feature_1    float64
feature_2    float64
feature_3    float64
feature_4    float64
feature_5    float64
dtype: object
Summary statistics:
        feature_1   feature_2   feature_3   feature_4   feature_5
count  100.000000  100.000000  100.000000  100.000000  100.000000
mean     0.504711    0.525232    0.487698    0.521254    0.453913
std      0.297463    0.299905    0.294821    0.293220    0.308130
min      0.009197    0.011354    0.005522    0.005062    0.015457
25%      0.279637    0.281719    0.193630    0.297294    0.167322
50%      0.502326    0.556379    0.513164    0.532713    0.419329
75%      0.773470    0.772236    0.746157    0.798420    0.726623
max      0.992965    0.990054    0.966655    0.974395    0.986887
First few rows of the dataset:
   feature_1  feature_2  feature_3  feature_4  feature_5
0   0.374540   0.950714   0.731994   0.598658   0.156019
1   0.155995   0.058084   0.866176   0.601115   0.708073
2   0.020584   0.969910   0.832443   0.212339   0.181825
3   0.183405   0.304242   0.524756   0.431945   0.291229
4   0.611853   0.139494   0.292145   0.366362   0.456070
Input shape: (100, 5)
Output shape: (100,)

The steps are as follows:

  1. Import the make_friedman1 function from sklearn.datasets and pandas for data manipulation:

    • The make_friedman1 function generates a synthetic regression dataset, while pandas helps with data manipulation and exploration.
  2. Generate the dataset using make_friedman1():

    • Specify n_samples=100 to create 100 samples, n_features=5 to generate five features, and random_state=42 for reproducibility.
  3. Convert the generated arrays to a DataFrame and Series:

    • Use pandas.DataFrame to create a DataFrame for the features and pandas.Series for the target variable.
  4. Print the dataset shape and feature types:

    • Access the shape using X_df.shape.
    • Show the data types of the features using X_df.dtypes.
  5. Display summary statistics:

    • Use X_df.describe() to get a statistical summary of the dataset.
  6. Display the first few rows of the dataset:

    • Print the initial rows using X_df.head() to get a sense of the dataset structure and content.
  7. Split the dataset into input and output elements:

    • Confirm the shapes of X_df (features) and y_df (target) to ensure the data is correctly separated.

This example demonstrates how to quickly generate and explore the make_friedman1 synthetic dataset using scikit-learn’s make_friedman1() function, setting the stage for further preprocessing and application of regression algorithms.



See Also