LeavePGroupsOut is a cross-validation method used when you have grouped data and want to ensure that entire groups are left out during training. This method is particularly useful for situations where the data is not independently and identically distributed within groups.
The key hyperparameters of LeavePGroupsOut
include n_groups
, which determines the number of groups to leave out in each iteration.
The algorithm is appropriate for any type of predictive modeling (classification, regression) where data is grouped.
from sklearn.datasets import make_classification
from sklearn.model_selection import LeavePGroupsOut, cross_val_score
from sklearn.linear_model import LogisticRegression
# generate synthetic dataset
X, y = make_classification(n_samples=100, n_features=5, n_classes=2, random_state=1)
groups = [i // 10 for i in range(100)] # creating 10 groups
# define the LeavePGroupsOut cross-validator
lpgo = LeavePGroupsOut(n_groups=2)
# create model
model = LogisticRegression()
# evaluate model
scores = cross_val_score(model, X, y, cv=lpgo, groups=groups)
print('Cross-Validation Scores:', scores)
print('Mean Score:', scores.mean())
Running the example gives an output like:
Cross-Validation Scores: [0.95 1. 1. 0.9 1. 1. 0.95 0.95 1. 0.95 0.95 0.85 0.95 0.95
0.9 0.9 0.95 1. 0.9 1. 1. 0.95 0.95 1. 0.9 1. 1. 0.95
0.95 1. 0.9 0.9 0.85 0.85 0.9 1. 0.95 0.95 1. 0.95 0.95 1.
0.9 0.95 0.95]
Mean Score: 0.9500000000000002
The steps are as follows:
First, a synthetic binary classification dataset is generated using
make_classification()
with 100 samples and 5 features. The dataset is split into 10 groups, each containing 10 samples.Next, the
LeavePGroupsOut
cross-validator is defined withn_groups
set to 2. This specifies that two groups will be left out in each iteration of the cross-validation.A
LogisticRegression
model is instantiated with default hyperparameters.The
cross_val_score
function is used to evaluate the model, passing theLeavePGroupsOut
object and the group labels. The scores for each cross-validation fold are printed, along with the mean score.
This example demonstrates how to use LeavePGroupsOut
for cross-validation with grouped data, ensuring that entire groups are left out during training, which is essential for certain types of grouped or clustered data to avoid data leakage and to get a realistic estimate of model performance.