KFold
and RepeatedKFold
are both cross-validation techniques provided by scikit-learn. Understanding the differences between them can help in choosing the right method for model evaluation.
KFold
performs a single k-fold cross-validation, which involves splitting the dataset into n_splits
(number of folds), training the model on n_splits-1
folds, and testing it on the remaining fold. Key hyperparameters for KFold
include n_splits
(number of folds) and shuffle
(whether to shuffle the data before splitting).
RepeatedKFold
extends KFold
by repeating the k-fold cross-validation process multiple times with different random splits, providing a more robust estimate of model performance. Key hyperparameters for RepeatedKFold
include n_splits
(number of folds), n_repeats
(number of times to repeat the cross-validation), and random_state
(for reproducibility).
The main difference between the two methods is that KFold
performs a single cross-validation run, while RepeatedKFold
performs multiple runs, reducing the variance in performance estimates. KFold
is simpler and faster, making it suitable for quick model evaluations. In contrast, RepeatedKFold
offers a more comprehensive evaluation by averaging results over multiple runs.
from sklearn.datasets import make_regression
from sklearn.model_selection import KFold, RepeatedKFold, cross_val_score
from sklearn.linear_model import LinearRegression
import numpy as np
# Generate synthetic regression dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)
# Define model
model = LinearRegression()
# Evaluate using KFold
kf = KFold(n_splits=5, shuffle=True, random_state=42)
kf_scores = cross_val_score(model, X, y, cv=kf, scoring='neg_mean_squared_error')
# Evaluate using RepeatedKFold
rkf = RepeatedKFold(n_splits=5, n_repeats=10, random_state=42)
rkf_scores = cross_val_score(model, X, y, cv=rkf, scoring='neg_mean_squared_error')
# Output scores
print(f"KFold scores: {kf_scores}")
print(f"RepeatedKFold scores: {rkf_scores}")
print(f"Mean KFold score: {np.mean(kf_scores)}")
print(f"Mean RepeatedKFold score: {np.mean(rkf_scores)}")
Running the example gives an output like:
KFold scores: [-0.01142775 -0.00946247 -0.01054515 -0.01144154 -0.01065718]
RepeatedKFold scores: [-0.01142775 -0.00946247 -0.01054515 -0.01144154 -0.01065718 -0.01013907
-0.01007182 -0.01097885 -0.01133617 -0.01066903 -0.01045655 -0.00970491
-0.01003722 -0.01195605 -0.01087992 -0.01037582 -0.01033117 -0.01022291
-0.01098296 -0.01101455 -0.00986439 -0.01136539 -0.00921436 -0.01090168
-0.0122378 -0.01127811 -0.00984213 -0.00951824 -0.01085106 -0.01145165
-0.01002567 -0.01112099 -0.01195406 -0.0105956 -0.00902033 -0.01030202
-0.0118322 -0.01184327 -0.0089605 -0.01001118 -0.01070117 -0.01125958
-0.01038447 -0.01062367 -0.01083241 -0.00971531 -0.01317321 -0.01010037
-0.01016452 -0.01001596]
Mean KFold score: -0.010706818851526514
Mean RepeatedKFold score: -0.01063704779321073
The steps are as follows:
- Generate a synthetic regression dataset using
make_regression
. - Define a
LinearRegression
model. - Instantiate
KFold
with 5 splits, shuffling enabled, and a random seed for reproducibility. - Evaluate the model using
cross_val_score
with negative mean squared error as the scoring metric forKFold
. - Instantiate
RepeatedKFold
with 5 splits, 10 repeats, and a random seed. - Evaluate the model using
cross_val_score
with negative mean squared error forRepeatedKFold
. - Compare the scores from
KFold
andRepeatedKFold
by printing individual fold scores and mean scores for both methods, highlighting the differences in performance estimates.