LeavePOut cross-validation is used to create training/test splits where p samples are left out for the test set in each iteration. This method is useful for evaluating model performance when dealing with small datasets, as it exhaustively tries every possible split of p samples.
Key hyperparameters include p
(the number of samples to leave out). LeavePOut is appropriate for all types of predictive modeling tasks (classification, regression).
from sklearn.model_selection import LeavePOut
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np
# generate a small synthetic dataset
X, y = make_classification(n_samples=10, n_features=5, n_classes=2, random_state=1)
# create LeavePOut cross-validator
lpocv = LeavePOut(p=2)
# create model
model = LogisticRegression()
# track scores for each iteration
scores = []
# perform cross-validation
for train_index, test_index in lpocv.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model.fit(X_train, y_train)
yhat = model.predict(X_test)
acc = accuracy_score(y_test, yhat)
scores.append(acc)
# summarize performance
print('Accuracy scores: %s' % np.array(scores))
print('Mean accuracy: %.3f' % np.mean(scores))
Running the example gives an output like:
Accuracy scores: [1. 1. 1. 1. 1. 1. 1. 0.5 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
0. 1. 0.5 1. 1. 1. 1. 1. 1. ]
Mean accuracy: 0.956
The steps are as follows:
First, a small synthetic binary classification dataset is generated using the
make_classification()
function. This creates a dataset with a specified number of samples (n_samples
), classes (n_classes
), and a fixed random seed (random_state
) for reproducibility.The
LeavePOut
cross-validator is created withp=2
, meaning that in each iteration, two samples are left out to form the test set.A
LogisticRegression
model is instantiated with default hyperparameters.For each train/test split generated by
LeavePOut
, the model is fit on the training data and evaluated on the test data using the accuracy score metric. The accuracy scores for each iteration are stored.The performance of the model is summarized by printing the accuracy scores for each iteration and the mean accuracy across all iterations.
This example demonstrates how to use LeavePOut
for cross-validation with small datasets, highlighting its exhaustive approach to splitting the data and evaluating model performance.