The make_friedman1
dataset is a synthetic dataset often used for testing regression algorithms. It simulates a complex regression problem with five input features and a continuous target variable.
Key arguments when generating the dataset include n_samples
to specify the number of samples, n_features
to define the number of features, and noise
to add Gaussian noise to the output.
This is a regression problem where common algorithms like Linear Regression, Decision Trees, and Gradient Boosting Regressors are typically applied.
from sklearn.datasets import make_friedman1
import pandas as pd
# Generate the dataset
X, y = make_friedman1(n_samples=100, n_features=5, noise=0.0, random_state=42)
# Convert to DataFrame for easier manipulation
X_df = pd.DataFrame(X, columns=[f"feature_{i}" for i in range(1, 6)])
y_df = pd.Series(y, name="target")
# Display dataset shape and types
print(f"Dataset shape: {X_df.shape}")
print(f"Feature types:\n{X_df.dtypes}")
# Show summary statistics
print(f"Summary statistics:\n{X_df.describe()}")
# Display first few rows of the dataset
print(f"First few rows of the dataset:\n{X_df.head()}")
# Split the dataset into input and output elements
print(f"Input shape: {X_df.shape}")
print(f"Output shape: {y_df.shape}")
Running the example gives an output like:
Dataset shape: (100, 5)
Feature types:
feature_1 float64
feature_2 float64
feature_3 float64
feature_4 float64
feature_5 float64
dtype: object
Summary statistics:
feature_1 feature_2 feature_3 feature_4 feature_5
count 100.000000 100.000000 100.000000 100.000000 100.000000
mean 0.504711 0.525232 0.487698 0.521254 0.453913
std 0.297463 0.299905 0.294821 0.293220 0.308130
min 0.009197 0.011354 0.005522 0.005062 0.015457
25% 0.279637 0.281719 0.193630 0.297294 0.167322
50% 0.502326 0.556379 0.513164 0.532713 0.419329
75% 0.773470 0.772236 0.746157 0.798420 0.726623
max 0.992965 0.990054 0.966655 0.974395 0.986887
First few rows of the dataset:
feature_1 feature_2 feature_3 feature_4 feature_5
0 0.374540 0.950714 0.731994 0.598658 0.156019
1 0.155995 0.058084 0.866176 0.601115 0.708073
2 0.020584 0.969910 0.832443 0.212339 0.181825
3 0.183405 0.304242 0.524756 0.431945 0.291229
4 0.611853 0.139494 0.292145 0.366362 0.456070
Input shape: (100, 5)
Output shape: (100,)
The steps are as follows:
Import the
make_friedman1
function fromsklearn.datasets
andpandas
for data manipulation:- The
make_friedman1
function generates a synthetic regression dataset, whilepandas
helps with data manipulation and exploration.
- The
Generate the dataset using
make_friedman1()
:- Specify
n_samples=100
to create 100 samples,n_features=5
to generate five features, andrandom_state=42
for reproducibility.
- Specify
Convert the generated arrays to a DataFrame and Series:
- Use
pandas.DataFrame
to create a DataFrame for the features andpandas.Series
for the target variable.
- Use
Print the dataset shape and feature types:
- Access the shape using
X_df.shape
. - Show the data types of the features using
X_df.dtypes
.
- Access the shape using
Display summary statistics:
- Use
X_df.describe()
to get a statistical summary of the dataset.
- Use
Display the first few rows of the dataset:
- Print the initial rows using
X_df.head()
to get a sense of the dataset structure and content.
- Print the initial rows using
Split the dataset into input and output elements:
- Confirm the shapes of
X_df
(features) andy_df
(target) to ensure the data is correctly separated.
- Confirm the shapes of
This example demonstrates how to quickly generate and explore the make_friedman1
synthetic dataset using scikit-learn’s make_friedman1()
function, setting the stage for further preprocessing and application of regression algorithms.