Scikit-Learn LeavePOut Data Splitting

LeavePOut cross-validation is used to create training/test splits where p samples are left out for the test set in each iteration. This method is useful for evaluating model performance when dealing with small datasets, as it exhaustively tries every possible split of p samples.

Key hyperparameters include p (the number of samples to leave out). LeavePOut is appropriate for all types of predictive modeling tasks (classification, regression).

from sklearn.model_selection import LeavePOut
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np

# generate a small synthetic dataset
X, y = make_classification(n_samples=10, n_features=5, n_classes=2, random_state=1)

# create LeavePOut cross-validator
lpocv = LeavePOut(p=2)

# create model
model = LogisticRegression()

# track scores for each iteration
scores = []

# perform cross-validation
for train_index, test_index in lpocv.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    model.fit(X_train, y_train)
    yhat = model.predict(X_test)
    acc = accuracy_score(y_test, yhat)
    scores.append(acc)

# summarize performance
print('Accuracy scores: %s' % np.array(scores))
print('Mean accuracy: %.3f' % np.mean(scores))

Running the example gives an output like:

Accuracy scores: [1.  1.  1.  1.  1.  1.  1.  0.5 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
 0.  1.  0.5 1.  1.  1.  1.  1.  1. ]
Mean accuracy: 0.956

The steps are as follows:

First, a small synthetic binary classification dataset is generated using the make_classification() function. This creates a dataset with a specified number of samples (n_samples), classes (n_classes), and a fixed random seed (random_state) for reproducibility.
The LeavePOut cross-validator is created with p=2, meaning that in each iteration, two samples are left out to form the test set.
A LogisticRegression model is instantiated with default hyperparameters.
For each train/test split generated by LeavePOut, the model is fit on the training data and evaluated on the test data using the accuracy score metric. The accuracy scores for each iteration are stored.
The performance of the model is summarized by printing the accuracy scores for each iteration and the mean accuracy across all iterations.

This example demonstrates how to use LeavePOut for cross-validation with small datasets, highlighting its exhaustive approach to splitting the data and evaluating model performance.

See Also