The return_train_score
parameter in RandomizedSearchCV
controls whether training scores are computed and returned during the hyperparameter search.
Random search is a hyperparameter optimization method that evaluates a given number of random hyperparameter combinations to find the best performing model.
The return_train_score
parameter is a boolean that defaults to False
. When set to True
, the training scores will be computed and included in the cv_results_
attribute of the RandomizedSearchCV
object.
Computing training scores can be useful for evaluating the model’s performance on the training set and detecting overfitting. However, it increases the computation time and memory usage, especially for large datasets or complex models.
As a heuristic, set return_train_score
to False
(default) unless you specifically need the training scores for analysis or debugging purposes.
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
import time
# Generate a synthetic binary classification dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=42)
# Define a parameter distribution for RandomForestClassifier hyperparameters
param_dist = {'n_estimators': randint(10, 100),
'max_depth': [None, 5, 10],
'min_samples_split': randint(2, 10)}
# Create a base RandomForestClassifier model
rf = RandomForestClassifier(random_state=42)
# List of return_train_score values to test
return_train_score_values = [False, True]
for return_train_score in return_train_score_values:
start_time = time.perf_counter()
# Run RandomizedSearchCV with the current return_train_score value
search = RandomizedSearchCV(rf, param_dist, n_iter=10, cv=5, return_train_score=return_train_score, random_state=42)
search.fit(X, y)
end_time = time.perf_counter()
execution_time = end_time - start_time
print(f"Keys in cv_results_ for return_train_score={return_train_score}:")
print(list(search.cv_results_.keys()))
print(f"Execution time for return_train_score={return_train_score}: {execution_time:.2f} seconds")
print()
Running the example gives an output like:
Keys in cv_results_ for return_train_score=False:
['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time', 'param_max_depth', 'param_min_samples_split', 'param_n_estimators', 'params', 'split0_test_score', 'split1_test_score', 'split2_test_score', 'split3_test_score', 'split4_test_score', 'mean_test_score', 'std_test_score', 'rank_test_score']
Execution time for return_train_score=False: 6.43 seconds
Keys in cv_results_ for return_train_score=True:
['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time', 'param_max_depth', 'param_min_samples_split', 'param_n_estimators', 'params', 'split0_test_score', 'split1_test_score', 'split2_test_score', 'split3_test_score', 'split4_test_score', 'mean_test_score', 'std_test_score', 'rank_test_score', 'split0_train_score', 'split1_train_score', 'split2_train_score', 'split3_train_score', 'split4_train_score', 'mean_train_score', 'std_train_score']
Execution time for return_train_score=True: 6.78 seconds
The steps are as follows:
- Generate a synthetic binary classification dataset using
make_classification()
. - Define a parameter distribution dictionary
param_dist
forRandomForestClassifier
hyperparameters. - Create a base
RandomForestClassifier
modelrf
. - Iterate over different
return_train_score
values (False
andTrue
). - For each
return_train_score
value:- Record the start time.
- Run
RandomizedSearchCV
with 10 iterations and 5-fold cross-validation. - Record the end time and calculate the execution time.
- Print the keys in
cv_results_
to show the presence or absence of training scores. - Print the execution time for the current
return_train_score
value.