SKLearner Home | About | Contact | Examples

Scikit-Learn LeavePGroupsOut Data Splitting

LeavePGroupsOut is a cross-validation method used when you have grouped data and want to ensure that entire groups are left out during training. This method is particularly useful for situations where the data is not independently and identically distributed within groups.

The key hyperparameters of LeavePGroupsOut include n_groups, which determines the number of groups to leave out in each iteration.

The algorithm is appropriate for any type of predictive modeling (classification, regression) where data is grouped.

from sklearn.datasets import make_classification
from sklearn.model_selection import LeavePGroupsOut, cross_val_score
from sklearn.linear_model import LogisticRegression

# generate synthetic dataset
X, y = make_classification(n_samples=100, n_features=5, n_classes=2, random_state=1)
groups = [i // 10 for i in range(100)]  # creating 10 groups

# define the LeavePGroupsOut cross-validator
lpgo = LeavePGroupsOut(n_groups=2)

# create model
model = LogisticRegression()

# evaluate model
scores = cross_val_score(model, X, y, cv=lpgo, groups=groups)
print('Cross-Validation Scores:', scores)
print('Mean Score:', scores.mean())

Running the example gives an output like:

Cross-Validation Scores: [0.95 1.   1.   0.9  1.   1.   0.95 0.95 1.   0.95 0.95 0.85 0.95 0.95
 0.9  0.9  0.95 1.   0.9  1.   1.   0.95 0.95 1.   0.9  1.   1.   0.95
 0.95 1.   0.9  0.9  0.85 0.85 0.9  1.   0.95 0.95 1.   0.95 0.95 1.
 0.9  0.95 0.95]
Mean Score: 0.9500000000000002

The steps are as follows:

  1. First, a synthetic binary classification dataset is generated using make_classification() with 100 samples and 5 features. The dataset is split into 10 groups, each containing 10 samples.

  2. Next, the LeavePGroupsOut cross-validator is defined with n_groups set to 2. This specifies that two groups will be left out in each iteration of the cross-validation.

  3. A LogisticRegression model is instantiated with default hyperparameters.

  4. The cross_val_score function is used to evaluate the model, passing the LeavePGroupsOut object and the group labels. The scores for each cross-validation fold are printed, along with the mean score.

This example demonstrates how to use LeavePGroupsOut for cross-validation with grouped data, ensuring that entire groups are left out during training, which is essential for certain types of grouped or clustered data to avoid data leakage and to get a realistic estimate of model performance.



See Also