Scikit-Learn make_classification() Dataset

Need a synthetic dataset to test and validate classification models? Use make_classification to create a custom dataset tailored to your specific needs.

The make_classification function from scikit-learn allows for generating a dataset with a specified number of samples, features, informative features, and redundant features. This is particularly useful for binary classification problems, where common algorithms like Logistic Regression, Decision Trees, and Random Forests are applied.

Key function arguments include:

n_samples to set the number of samples, e.g., 1000.
n_features to define the total number of features, e.g., 20.
n_informative to specify the number of informative features, e.g., 2.
n_redundant to indicate the number of redundant features, e.g., 2.

from sklearn.datasets import make_classification
import pandas as pd

# Generate a synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=2, n_redundant=2, random_state=42)

# Convert to DataFrame for easier manipulation
X_df = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(20)])
y_df = pd.DataFrame(y, columns=['target'])

# Display dataset shape and types
print(f"Dataset shape: {X_df.shape}")
print(f"Feature types:\n{X_df.dtypes}")

# Show summary statistics
print(f"Summary statistics:\n{X_df.describe()}")

# Display first few rows of the dataset
print(f"First few rows of the dataset:\n{X_df.head()}")

# Split the dataset into input and output elements
print(f"Input shape: {X_df.shape}")
print(f"Output shape: {y_df.shape}")

Running the example gives an output like:

Dataset shape: (1000, 20)
Feature types:
feature_0     float64
feature_1     float64
feature_2     float64
feature_3     float64
feature_4     float64
feature_5     float64
feature_6     float64
feature_7     float64
feature_8     float64
feature_9     float64
feature_10    float64
feature_11    float64
feature_12    float64
feature_13    float64
feature_14    float64
feature_15    float64
feature_16    float64
feature_17    float64
feature_18    float64
feature_19    float64
dtype: object
Summary statistics:
         feature_0    feature_1  ...   feature_18   feature_19
count  1000.000000  1000.000000  ...  1000.000000  1000.000000
mean     -0.008362     0.029704  ...     0.021475    -0.005455
std       1.023509     0.858969  ...     0.818102     1.021241
min      -3.688365    -3.281236  ...    -2.787107    -3.250333
25%      -0.690946    -0.456377  ...    -0.517717    -0.708232
50%       0.018186     0.050139  ...     0.128403     0.010684
75%       0.698082     0.544342  ...     0.539337     0.707936
max       3.529055     2.872178  ...     2.817960     3.152057

[8 rows x 20 columns]
First few rows of the dataset:
   feature_0  feature_1  feature_2  ...  feature_17  feature_18  feature_19
0  -0.669356  -1.495778  -0.870766  ...   -1.267337   -1.276334    1.016643
1   0.093372   0.785848   0.105754  ...   -0.122709    0.693431    0.911363
2  -0.905797  -0.608341   0.295141  ...    0.830498   -0.737332   -0.578212
3  -0.585793   0.389279   0.698816  ...   -0.346772    0.034246   -1.040199
4   1.146441   0.515579  -1.222895  ...    1.259233    0.360015    1.920368

[5 rows x 20 columns]
Input shape: (1000, 20)

The steps are as follows:

Import make_classification from sklearn.datasets and pandas:
- These functions allow generating a synthetic dataset and converting it to a DataFrame.
Generate the dataset using make_classification():
- Create 1000 samples with 20 features, 2 informative and 2 redundant features, ensuring reproducibility with random_state=42.
Convert to DataFrame for easier manipulation:
- Use pandas.DataFrame to convert X and y to DataFrame format with appropriate column names.
Print the dataset shape and feature types:
- Access the shape using X_df.shape.
- Show the data types of the features using X_df.dtypes.
Display summary statistics:
- Use X_df.describe() to get a statistical summary of the dataset.
Display the first few rows of the dataset:
- Print the initial rows using X_df.head() to get a sense of the dataset structure and content.
Split the dataset into input and output elements:
- Confirm the split by printing the shapes of X_df and y_df.

This example demonstrates how to quickly generate a synthetic dataset using make_classification, allowing for the creation of a customized dataset for testing classification algorithms. The steps include dataset generation, conversion to DataFrame, and inspection of dataset properties, setting the stage for further preprocessing and model application.

See Also