Scikit-Learn OutputCodeClassifier Model

OutputCodeClassifier is an ensemble method for multi-class classification. It transforms the multi-class problem into multiple binary classification problems using a coding matrix.

This classifier is particularly useful for handling complex classification tasks by leveraging simpler binary classifiers.

The key hyperparameters include code_size (number of binary classifiers), estimator (base estimator), and random_state (controls randomness). OutputCodeClassifier is suitable for multi-class classification problems.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.multiclass import OutputCodeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# generate multi-class classification dataset
X, y = make_classification(n_samples=100, n_clusters_per_class=1, n_features=20, n_classes=3, random_state=1)

# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# create model
base_estimator = LogisticRegression()
model = OutputCodeClassifier(estimator=base_estimator, code_size=2, random_state=1)

# fit model
model.fit(X_train, y_train)

# evaluate model
yhat = model.predict(X_test)
acc = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % acc)

# make a prediction
row = [[-0.5, 0.5, -0.3, -0.1, 0.2, -1.2, 0.4, 1.0, -0.6, 0.7, -0.1, 0.3, -0.4, 0.6, -0.8, 1.1, -1.0, 0.9, -0.7, 0.5]]
yhat = model.predict(row)
print('Predicted: %d' % yhat[0])

Running the example gives an output like:

Accuracy: 0.950
Predicted: 0

The steps are as follows:

First, a synthetic multi-class classification dataset is generated using the make_classification() function. This creates a dataset with 100 samples, 20 features, 3 classes, and a fixed random seed (random_state) for reproducibility. The dataset is split into training and test sets using train_test_split().
Next, a LogisticRegression model is instantiated as the base estimator. An OutputCodeClassifier model is created with the logistic regression estimator, a code_size of 2, and a random_state for reproducibility. The model is then fit on the training data using the fit() method.
The performance of the model is evaluated by comparing the predictions (yhat) to the actual values (y_test) using the accuracy score metric.
A single prediction can be made by passing a new data sample to the predict() method.

This example demonstrates how to use OutputCodeClassifier for multi-class classification tasks, showcasing the effectiveness of this ensemble method in scikit-learn.

See Also