Scikit-Learn "RepeatedKFold" versus "KFold"

KFold and RepeatedKFold are both cross-validation techniques provided by scikit-learn. Understanding the differences between them can help in choosing the right method for model evaluation.

KFold performs a single k-fold cross-validation, which involves splitting the dataset into n_splits (number of folds), training the model on n_splits-1 folds, and testing it on the remaining fold. Key hyperparameters for KFold include n_splits (number of folds) and shuffle (whether to shuffle the data before splitting).

RepeatedKFold extends KFold by repeating the k-fold cross-validation process multiple times with different random splits, providing a more robust estimate of model performance. Key hyperparameters for RepeatedKFold include n_splits (number of folds), n_repeats (number of times to repeat the cross-validation), and random_state (for reproducibility).

The main difference between the two methods is that KFold performs a single cross-validation run, while RepeatedKFold performs multiple runs, reducing the variance in performance estimates. KFold is simpler and faster, making it suitable for quick model evaluations. In contrast, RepeatedKFold offers a more comprehensive evaluation by averaging results over multiple runs.

from sklearn.datasets import make_regression
from sklearn.model_selection import KFold, RepeatedKFold, cross_val_score
from sklearn.linear_model import LinearRegression
import numpy as np

# Generate synthetic regression dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)

# Define model
model = LinearRegression()

# Evaluate using KFold
kf = KFold(n_splits=5, shuffle=True, random_state=42)
kf_scores = cross_val_score(model, X, y, cv=kf, scoring='neg_mean_squared_error')

# Evaluate using RepeatedKFold
rkf = RepeatedKFold(n_splits=5, n_repeats=10, random_state=42)
rkf_scores = cross_val_score(model, X, y, cv=rkf, scoring='neg_mean_squared_error')

# Output scores
print(f"KFold scores: {kf_scores}")
print(f"RepeatedKFold scores: {rkf_scores}")
print(f"Mean KFold score: {np.mean(kf_scores)}")
print(f"Mean RepeatedKFold score: {np.mean(rkf_scores)}")

Running the example gives an output like:

KFold scores: [-0.01142775 -0.00946247 -0.01054515 -0.01144154 -0.01065718]
RepeatedKFold scores: [-0.01142775 -0.00946247 -0.01054515 -0.01144154 -0.01065718 -0.01013907
 -0.01007182 -0.01097885 -0.01133617 -0.01066903 -0.01045655 -0.00970491
 -0.01003722 -0.01195605 -0.01087992 -0.01037582 -0.01033117 -0.01022291
 -0.01098296 -0.01101455 -0.00986439 -0.01136539 -0.00921436 -0.01090168
 -0.0122378  -0.01127811 -0.00984213 -0.00951824 -0.01085106 -0.01145165
 -0.01002567 -0.01112099 -0.01195406 -0.0105956  -0.00902033 -0.01030202
 -0.0118322  -0.01184327 -0.0089605  -0.01001118 -0.01070117 -0.01125958
 -0.01038447 -0.01062367 -0.01083241 -0.00971531 -0.01317321 -0.01010037
 -0.01016452 -0.01001596]
Mean KFold score: -0.010706818851526514
Mean RepeatedKFold score: -0.01063704779321073

The steps are as follows:

Generate a synthetic regression dataset using make_regression.
Define a LinearRegression model.
Instantiate KFold with 5 splits, shuffling enabled, and a random seed for reproducibility.
Evaluate the model using cross_val_score with negative mean squared error as the scoring metric for KFold.
Instantiate RepeatedKFold with 5 splits, 10 repeats, and a random seed.
Evaluate the model using cross_val_score with negative mean squared error for RepeatedKFold.
Compare the scores from KFold and RepeatedKFold by printing individual fold scores and mean scores for both methods, highlighting the differences in performance estimates.

See Also