SKLearner Home | About | Contact | Examples

Scikit-Learn LeaveOneGroupOut Data Splitting

LeaveOneGroupOut (LOGO) is a cross-validation strategy used in scikit-learn for evaluating models on grouped data.

It ensures that the training and testing sets are divided by predefined groups, which helps to avoid data leakage and ensures the model’s generalization across different groups.

This strategy is particularly useful for datasets with natural groupings, such as repeated measurements from the same subject or multiple samples from the same location.

The LeaveOneGroupOut strategy does not have hyperparameters but relies on the grouping of data to create splits. It is suitable for both classification and regression tasks.

from sklearn.datasets import make_classification
from sklearn.model_selection import LeaveOneGroupOut, cross_val_score
from sklearn.linear_model import LogisticRegression
import numpy as np

# generate a synthetic dataset with group labels
X, y = make_classification(n_samples=100, n_features=5, random_state=1)
groups = np.array([i // 10 for i in range(100)])  # 10 groups, each containing 10 samples

# define the LeaveOneGroupOut cross-validator
logo = LeaveOneGroupOut()

# create a LogisticRegression model
model = LogisticRegression()

# evaluate the model using cross-validation with LOGO
scores = cross_val_score(model, X, y, groups=groups, cv=logo)
print('Cross-validation scores:', scores)
print('Mean cross-validation score:', np.mean(scores))

Running the example gives an output like:

Cross-validation scores: [1.  0.9 1.  1.  0.8 1.  1.  0.9 0.9 1. ]
Mean cross-validation score: 0.95

The steps are as follows:

  1. First, a synthetic dataset is generated using the make_classification() function. This creates a dataset with 100 samples and 5 features. The samples are grouped into 10 groups, each containing 10 samples, using an array of group labels.

  2. Next, the LeaveOneGroupOut cross-validator is defined as logo.

  3. A LogisticRegression model is instantiated with default hyperparameters.

  4. The model is evaluated using cross_val_score() with the LeaveOneGroupOut cross-validator. The cross-validation scores for each group are printed, and the mean cross-validation score is calculated and printed.

This example demonstrates the use of the LeaveOneGroupOut cross-validator to evaluate a model on grouped data, ensuring the model’s robustness across different groups. The LeaveOneGroupOut strategy is especially useful in scenarios where data is naturally grouped, and we need to prevent data leakage between training and testing sets.



See Also