Scikit-Learn make_multilabel_classification() Dataset

Datasets

The make_multilabel_classification() function in scikit-learn generates a random multi-label classification problem. The dataset is synthetic and typically used for creating test cases for multi-label classification algorithms. Key function arguments include n_samples to specify the number of samples, n_features for the number of features, and n_classes for the number of classes. This example demonstrates how to generate a multi-label classification dataset and inspect its properties.

from sklearn.datasets import make_multilabel_classification
import pandas as pd

# Generate the dataset
X, y = make_multilabel_classification(n_samples=100, n_features=20, n_classes=5, random_state=42)

# Convert to DataFrame for easier inspection
X_df = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(X.shape[1])])
y_df = pd.DataFrame(y, columns=[f'class_{i}' for i in range(y.shape[1])])

# Display dataset shape and types
print(f"Input shape: {X_df.shape}")
print(f"Output shape: {y_df.shape}")
print(f"Feature types:\n{X_df.dtypes}")

# Show summary statistics
print(f"Summary statistics:\n{X_df.describe()}")

# Display first few rows of the input dataset
print(f"First few rows of the input dataset:\n{X_df.head()}")

# Display first few rows of the output dataset
print(f"First few rows of the output dataset:\n{y_df.head()}")

Running the example gives an output like:

Input shape: (100, 20)
Output shape: (100, 5)
Feature types:
feature_0     float64
feature_1     float64
feature_2     float64
feature_3     float64
feature_4     float64
feature_5     float64
feature_6     float64
feature_7     float64
feature_8     float64
feature_9     float64
feature_10    float64
feature_11    float64
feature_12    float64
feature_13    float64
feature_14    float64
feature_15    float64
feature_16    float64
feature_17    float64
feature_18    float64
feature_19    float64
dtype: object
Summary statistics:
        feature_0   feature_1   feature_2  ...  feature_17  feature_18  feature_19
count  100.000000  100.000000  100.000000  ...  100.000000   100.00000  100.000000
mean     2.280000    3.290000    2.090000  ...    3.430000     2.13000    2.510000
std      1.907349    2.070939    1.303414  ...    2.090165     1.61217    1.690705
min      0.000000    0.000000    0.000000  ...    0.000000     0.00000    0.000000
25%      1.000000    2.000000    1.000000  ...    2.000000     1.00000    1.000000
50%      2.000000    3.000000    2.000000  ...    3.000000     2.00000    2.000000
75%      3.000000    5.000000    3.000000  ...    5.000000     3.00000    4.000000
max      9.000000    9.000000    7.000000  ...    9.000000     6.00000    7.000000

[8 rows x 20 columns]
First few rows of the input dataset:
   feature_0  feature_1  feature_2  ...  feature_17  feature_18  feature_19
0        3.0        0.0        2.0  ...         5.0         0.0         2.0
1        3.0        5.0        2.0  ...         6.0         2.0         0.0
2        3.0        2.0        3.0  ...         6.0         3.0         3.0
3        1.0        0.0        1.0  ...         0.0         3.0         0.0
4        3.0        6.0        2.0  ...         5.0         4.0         1.0

[5 rows x 20 columns]
First few rows of the output dataset:
   class_0  class_1  class_2  class_3  class_4
0        0        0        0        1        0
1        1        1        1        0        0
2        0        0        1        1        0
3        1        0        0        0        0
4        1        0        1        0        0

The steps are as follows:

Import the make_multilabel_classification function from sklearn.datasets and pandas for data manipulation.
- make_multilabel_classification generates a random multi-label classification problem.
Generate the dataset using make_multilabel_classification():
- Use parameters like n_samples=100, n_features=20, and n_classes=5 to create a synthetic dataset.
Convert the generated arrays to pandas DataFrames for easier inspection and manipulation.
- Create X_df for features and y_df for labels.
Print the dataset shape and feature types:
- Access the shape using X_df.shape and y_df.shape.
- Show the data types of the features using X_df.dtypes.
Display summary statistics:
- Use X_df.describe() to get a statistical summary of the features.
Display the first few rows of the input and output datasets:
- Print the initial rows using X_df.head() and y_df.head() to understand the structure and content.

This example demonstrates how to use make_multilabel_classification() to create and inspect a synthetic multi-label classification dataset, providing a foundation for developing and testing multi-label classification algorithms.

See Also