The make_multilabel_classification()
function in scikit-learn generates a random multi-label classification problem. The dataset is synthetic and typically used for creating test cases for multi-label classification algorithms. Key function arguments include n_samples
to specify the number of samples, n_features
for the number of features, and n_classes
for the number of classes. This example demonstrates how to generate a multi-label classification dataset and inspect its properties.
from sklearn.datasets import make_multilabel_classification
import pandas as pd
# Generate the dataset
X, y = make_multilabel_classification(n_samples=100, n_features=20, n_classes=5, random_state=42)
# Convert to DataFrame for easier inspection
X_df = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(X.shape[1])])
y_df = pd.DataFrame(y, columns=[f'class_{i}' for i in range(y.shape[1])])
# Display dataset shape and types
print(f"Input shape: {X_df.shape}")
print(f"Output shape: {y_df.shape}")
print(f"Feature types:\n{X_df.dtypes}")
# Show summary statistics
print(f"Summary statistics:\n{X_df.describe()}")
# Display first few rows of the input dataset
print(f"First few rows of the input dataset:\n{X_df.head()}")
# Display first few rows of the output dataset
print(f"First few rows of the output dataset:\n{y_df.head()}")
Running the example gives an output like:
Input shape: (100, 20)
Output shape: (100, 5)
Feature types:
feature_0 float64
feature_1 float64
feature_2 float64
feature_3 float64
feature_4 float64
feature_5 float64
feature_6 float64
feature_7 float64
feature_8 float64
feature_9 float64
feature_10 float64
feature_11 float64
feature_12 float64
feature_13 float64
feature_14 float64
feature_15 float64
feature_16 float64
feature_17 float64
feature_18 float64
feature_19 float64
dtype: object
Summary statistics:
feature_0 feature_1 feature_2 ... feature_17 feature_18 feature_19
count 100.000000 100.000000 100.000000 ... 100.000000 100.00000 100.000000
mean 2.280000 3.290000 2.090000 ... 3.430000 2.13000 2.510000
std 1.907349 2.070939 1.303414 ... 2.090165 1.61217 1.690705
min 0.000000 0.000000 0.000000 ... 0.000000 0.00000 0.000000
25% 1.000000 2.000000 1.000000 ... 2.000000 1.00000 1.000000
50% 2.000000 3.000000 2.000000 ... 3.000000 2.00000 2.000000
75% 3.000000 5.000000 3.000000 ... 5.000000 3.00000 4.000000
max 9.000000 9.000000 7.000000 ... 9.000000 6.00000 7.000000
[8 rows x 20 columns]
First few rows of the input dataset:
feature_0 feature_1 feature_2 ... feature_17 feature_18 feature_19
0 3.0 0.0 2.0 ... 5.0 0.0 2.0
1 3.0 5.0 2.0 ... 6.0 2.0 0.0
2 3.0 2.0 3.0 ... 6.0 3.0 3.0
3 1.0 0.0 1.0 ... 0.0 3.0 0.0
4 3.0 6.0 2.0 ... 5.0 4.0 1.0
[5 rows x 20 columns]
First few rows of the output dataset:
class_0 class_1 class_2 class_3 class_4
0 0 0 0 1 0
1 1 1 1 0 0
2 0 0 1 1 0
3 1 0 0 0 0
4 1 0 1 0 0
The steps are as follows:
Import the
make_multilabel_classification
function fromsklearn.datasets
andpandas
for data manipulation.make_multilabel_classification
generates a random multi-label classification problem.
Generate the dataset using
make_multilabel_classification()
:- Use parameters like
n_samples=100
,n_features=20
, andn_classes=5
to create a synthetic dataset.
- Use parameters like
Convert the generated arrays to pandas DataFrames for easier inspection and manipulation.
- Create
X_df
for features andy_df
for labels.
- Create
Print the dataset shape and feature types:
- Access the shape using
X_df.shape
andy_df.shape
. - Show the data types of the features using
X_df.dtypes
.
- Access the shape using
Display summary statistics:
- Use
X_df.describe()
to get a statistical summary of the features.
- Use
Display the first few rows of the input and output datasets:
- Print the initial rows using
X_df.head()
andy_df.head()
to understand the structure and content.
- Print the initial rows using
This example demonstrates how to use make_multilabel_classification()
to create and inspect a synthetic multi-label classification dataset, providing a foundation for developing and testing multi-label classification algorithms.