LeaveOneGroupOut (LOGO) is a cross-validation strategy used in scikit-learn for evaluating models on grouped data.
It ensures that the training and testing sets are divided by predefined groups, which helps to avoid data leakage and ensures the model’s generalization across different groups.
This strategy is particularly useful for datasets with natural groupings, such as repeated measurements from the same subject or multiple samples from the same location.
The LeaveOneGroupOut
strategy does not have hyperparameters but relies on the grouping of data to create splits. It is suitable for both classification and regression tasks.
from sklearn.datasets import make_classification
from sklearn.model_selection import LeaveOneGroupOut, cross_val_score
from sklearn.linear_model import LogisticRegression
import numpy as np
# generate a synthetic dataset with group labels
X, y = make_classification(n_samples=100, n_features=5, random_state=1)
groups = np.array([i // 10 for i in range(100)]) # 10 groups, each containing 10 samples
# define the LeaveOneGroupOut cross-validator
logo = LeaveOneGroupOut()
# create a LogisticRegression model
model = LogisticRegression()
# evaluate the model using cross-validation with LOGO
scores = cross_val_score(model, X, y, groups=groups, cv=logo)
print('Cross-validation scores:', scores)
print('Mean cross-validation score:', np.mean(scores))
Running the example gives an output like:
Cross-validation scores: [1. 0.9 1. 1. 0.8 1. 1. 0.9 0.9 1. ]
Mean cross-validation score: 0.95
The steps are as follows:
First, a synthetic dataset is generated using the
make_classification()
function. This creates a dataset with 100 samples and 5 features. The samples are grouped into 10 groups, each containing 10 samples, using an array of group labels.Next, the
LeaveOneGroupOut
cross-validator is defined aslogo
.A
LogisticRegression
model is instantiated with default hyperparameters.The model is evaluated using
cross_val_score()
with theLeaveOneGroupOut
cross-validator. The cross-validation scores for each group are printed, and the mean cross-validation score is calculated and printed.
This example demonstrates the use of the LeaveOneGroupOut
cross-validator to evaluate a model on grouped data, ensuring the model’s robustness across different groups. The LeaveOneGroupOut
strategy is especially useful in scenarios where data is naturally grouped, and we need to prevent data leakage between training and testing sets.