OutputCodeClassifier
is an ensemble method for multi-class classification. It transforms the multi-class problem into multiple binary classification problems using a coding matrix.
This classifier is particularly useful for handling complex classification tasks by leveraging simpler binary classifiers.
The key hyperparameters include code_size
(number of binary classifiers), estimator
(base estimator), and random_state
(controls randomness). OutputCodeClassifier
is suitable for multi-class classification problems.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.multiclass import OutputCodeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# generate multi-class classification dataset
X, y = make_classification(n_samples=100, n_clusters_per_class=1, n_features=20, n_classes=3, random_state=1)
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# create model
base_estimator = LogisticRegression()
model = OutputCodeClassifier(estimator=base_estimator, code_size=2, random_state=1)
# fit model
model.fit(X_train, y_train)
# evaluate model
yhat = model.predict(X_test)
acc = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % acc)
# make a prediction
row = [[-0.5, 0.5, -0.3, -0.1, 0.2, -1.2, 0.4, 1.0, -0.6, 0.7, -0.1, 0.3, -0.4, 0.6, -0.8, 1.1, -1.0, 0.9, -0.7, 0.5]]
yhat = model.predict(row)
print('Predicted: %d' % yhat[0])
Running the example gives an output like:
Accuracy: 0.950
Predicted: 0
The steps are as follows:
First, a synthetic multi-class classification dataset is generated using the
make_classification()
function. This creates a dataset with 100 samples, 20 features, 3 classes, and a fixed random seed (random_state
) for reproducibility. The dataset is split into training and test sets usingtrain_test_split()
.Next, a
LogisticRegression
model is instantiated as the base estimator. AnOutputCodeClassifier
model is created with the logistic regression estimator, acode_size
of 2, and arandom_state
for reproducibility. The model is then fit on the training data using thefit()
method.The performance of the model is evaluated by comparing the predictions (
yhat
) to the actual values (y_test
) using the accuracy score metric.A single prediction can be made by passing a new data sample to the
predict()
method.
This example demonstrates how to use OutputCodeClassifier
for multi-class classification tasks, showcasing the effectiveness of this ensemble method in scikit-learn.