Scikit-Learn Configure RandomizedSearchCV "return_train_score" Parameter

The return_train_score parameter in RandomizedSearchCV controls whether training scores are computed and returned during the hyperparameter search.

Random search is a hyperparameter optimization method that evaluates a given number of random hyperparameter combinations to find the best performing model.

The return_train_score parameter is a boolean that defaults to False. When set to True, the training scores will be computed and included in the cv_results_ attribute of the RandomizedSearchCV object.

Computing training scores can be useful for evaluating the model’s performance on the training set and detecting overfitting. However, it increases the computation time and memory usage, especially for large datasets or complex models.

As a heuristic, set return_train_score to False (default) unless you specifically need the training scores for analysis or debugging purposes.

from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
import time

# Generate a synthetic binary classification dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=42)

# Define a parameter distribution for RandomForestClassifier hyperparameters
param_dist = {'n_estimators': randint(10, 100),
              'max_depth': [None, 5, 10],
              'min_samples_split': randint(2, 10)}

# Create a base RandomForestClassifier model
rf = RandomForestClassifier(random_state=42)

# List of return_train_score values to test
return_train_score_values = [False, True]

for return_train_score in return_train_score_values:
    start_time = time.perf_counter()

    # Run RandomizedSearchCV with the current return_train_score value
    search = RandomizedSearchCV(rf, param_dist, n_iter=10, cv=5, return_train_score=return_train_score, random_state=42)
    search.fit(X, y)

    end_time = time.perf_counter()
    execution_time = end_time - start_time

    print(f"Keys in cv_results_ for return_train_score={return_train_score}:")
    print(list(search.cv_results_.keys()))
    print(f"Execution time for return_train_score={return_train_score}: {execution_time:.2f} seconds")
    print()

Running the example gives an output like:

Keys in cv_results_ for return_train_score=False:
['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time', 'param_max_depth', 'param_min_samples_split', 'param_n_estimators', 'params', 'split0_test_score', 'split1_test_score', 'split2_test_score', 'split3_test_score', 'split4_test_score', 'mean_test_score', 'std_test_score', 'rank_test_score']
Execution time for return_train_score=False: 6.43 seconds

Keys in cv_results_ for return_train_score=True:
['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time', 'param_max_depth', 'param_min_samples_split', 'param_n_estimators', 'params', 'split0_test_score', 'split1_test_score', 'split2_test_score', 'split3_test_score', 'split4_test_score', 'mean_test_score', 'std_test_score', 'rank_test_score', 'split0_train_score', 'split1_train_score', 'split2_train_score', 'split3_train_score', 'split4_train_score', 'mean_train_score', 'std_train_score']
Execution time for return_train_score=True: 6.78 seconds

The steps are as follows:

Generate a synthetic binary classification dataset using make_classification().
Define a parameter distribution dictionary param_dist for RandomForestClassifier hyperparameters.
Create a base RandomForestClassifier model rf.
Iterate over different return_train_score values (False and True).
For each return_train_score value:
- Record the start time.
- Run RandomizedSearchCV with 10 iterations and 5-fold cross-validation.
- Record the end time and calculate the execution time.
- Print the keys in cv_results_ to show the presence or absence of training scores.
- Print the execution time for the current return_train_score value.

See Also