Scikit-Learn RepeatedKFold Data Splitting

RepeatedKFold is a cross-validation strategy that repeats the K-Fold cross-validation process multiple times to provide more robust estimates of model performance. This technique is particularly useful for small datasets or models with high variance.

The key parameters of RepeatedKFold include n_splits (number of splits), n_repeats (number of repetitions), and random_state (seed for reproducibility).

The algorithm is appropriate for regression and classification problems.

from sklearn.datasets import make_classification
from sklearn.model_selection import RepeatedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression

# generate synthetic dataset
X, y = make_classification(n_samples=100, n_features=10, n_classes=2, random_state=1)

# define model
model = LogisticRegression()

# define evaluation procedure
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

# evaluate model
scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')

# report performance
print('Accuracy: %.3f (%.3f)' % (scores.mean(), scores.std()))

Running the example gives an output like:

Accuracy: 0.967 (0.047)

The steps are as follows:

Generate a synthetic binary classification dataset using the make_classification() function. Specify parameters for sample size (n_samples), number of features (n_features), number of classes (n_classes), and a fixed random seed (random_state).
Define a LogisticRegression model instance.
Define the cross-validation procedure using RepeatedKFold with parameters for the number of splits (n_splits), number of repeats (n_repeats), and a random seed (random_state).
Evaluate the model’s performance using cross_val_score(), specifying the model, dataset, cross-validation strategy, and scoring metric (accuracy).
Print the mean and standard deviation of the accuracy scores to summarize the model’s performance across multiple folds and repeats.

See Also