Need a synthetic dataset to test and validate classification models? Use make_classification
to create a custom dataset tailored to your specific needs.
The make_classification
function from scikit-learn allows for generating a dataset with a specified number of samples, features, informative features, and redundant features. This is particularly useful for binary classification problems, where common algorithms like Logistic Regression, Decision Trees, and Random Forests are applied.
Key function arguments include:
n_samples
to set the number of samples, e.g., 1000.n_features
to define the total number of features, e.g., 20.n_informative
to specify the number of informative features, e.g., 2.n_redundant
to indicate the number of redundant features, e.g., 2.
from sklearn.datasets import make_classification
import pandas as pd
# Generate a synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=2, n_redundant=2, random_state=42)
# Convert to DataFrame for easier manipulation
X_df = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(20)])
y_df = pd.DataFrame(y, columns=['target'])
# Display dataset shape and types
print(f"Dataset shape: {X_df.shape}")
print(f"Feature types:\n{X_df.dtypes}")
# Show summary statistics
print(f"Summary statistics:\n{X_df.describe()}")
# Display first few rows of the dataset
print(f"First few rows of the dataset:\n{X_df.head()}")
# Split the dataset into input and output elements
print(f"Input shape: {X_df.shape}")
print(f"Output shape: {y_df.shape}")
Running the example gives an output like:
Dataset shape: (1000, 20)
Feature types:
feature_0 float64
feature_1 float64
feature_2 float64
feature_3 float64
feature_4 float64
feature_5 float64
feature_6 float64
feature_7 float64
feature_8 float64
feature_9 float64
feature_10 float64
feature_11 float64
feature_12 float64
feature_13 float64
feature_14 float64
feature_15 float64
feature_16 float64
feature_17 float64
feature_18 float64
feature_19 float64
dtype: object
Summary statistics:
feature_0 feature_1 ... feature_18 feature_19
count 1000.000000 1000.000000 ... 1000.000000 1000.000000
mean -0.008362 0.029704 ... 0.021475 -0.005455
std 1.023509 0.858969 ... 0.818102 1.021241
min -3.688365 -3.281236 ... -2.787107 -3.250333
25% -0.690946 -0.456377 ... -0.517717 -0.708232
50% 0.018186 0.050139 ... 0.128403 0.010684
75% 0.698082 0.544342 ... 0.539337 0.707936
max 3.529055 2.872178 ... 2.817960 3.152057
[8 rows x 20 columns]
First few rows of the dataset:
feature_0 feature_1 feature_2 ... feature_17 feature_18 feature_19
0 -0.669356 -1.495778 -0.870766 ... -1.267337 -1.276334 1.016643
1 0.093372 0.785848 0.105754 ... -0.122709 0.693431 0.911363
2 -0.905797 -0.608341 0.295141 ... 0.830498 -0.737332 -0.578212
3 -0.585793 0.389279 0.698816 ... -0.346772 0.034246 -1.040199
4 1.146441 0.515579 -1.222895 ... 1.259233 0.360015 1.920368
[5 rows x 20 columns]
Input shape: (1000, 20)
The steps are as follows:
Import
make_classification
fromsklearn.datasets
andpandas
:- These functions allow generating a synthetic dataset and converting it to a DataFrame.
Generate the dataset using
make_classification()
:- Create 1000 samples with 20 features, 2 informative and 2 redundant features, ensuring reproducibility with
random_state=42
.
- Create 1000 samples with 20 features, 2 informative and 2 redundant features, ensuring reproducibility with
Convert to DataFrame for easier manipulation:
- Use
pandas.DataFrame
to convertX
andy
to DataFrame format with appropriate column names.
- Use
Print the dataset shape and feature types:
- Access the shape using
X_df.shape
. - Show the data types of the features using
X_df.dtypes
.
- Access the shape using
Display summary statistics:
- Use
X_df.describe()
to get a statistical summary of the dataset.
- Use
Display the first few rows of the dataset:
- Print the initial rows using
X_df.head()
to get a sense of the dataset structure and content.
- Print the initial rows using
Split the dataset into input and output elements:
- Confirm the split by printing the shapes of
X_df
andy_df
.
- Confirm the split by printing the shapes of
This example demonstrates how to quickly generate a synthetic dataset using make_classification
, allowing for the creation of a customized dataset for testing classification algorithms. The steps include dataset generation, conversion to DataFrame, and inspection of dataset properties, setting the stage for further preprocessing and model application.