Scikit-Learn Configure RandomizedSearchCV "param_distributions" Parameter

The param_distributions parameter in RandomizedSearchCV is a dictionary that specifies the hyperparameter search space for the random search process. Random search is a hyperparameter optimization method that samples random combinations of hyperparameters from the provided distributions to find the best performing model.

By default, param_distributions is set to None, which means it must be explicitly specified by the user. The keys of the dictionary are the names of the hyperparameters, and the values are the corresponding distributions or lists of values to sample from.

When choosing distributions for hyperparameters, it’s important to consider their type and range. For continuous parameters, such as learning rate or regularization strength, a uniform or log-uniform distribution can be used. For integer parameters, such as the number of estimators or maximum depth, an integer uniform distribution or a list of specific values can be used. For categorical parameters, a list of the possible values should be provided.

from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

# Generate a synthetic multiclass classification dataset
X, y = make_classification(n_samples=100, n_classes=3, n_informative=5, n_redundant=5, random_state=42)

# Define the parameter distributions for RandomForestClassifier
param_distributions = {
    'n_estimators': randint(5, 50),
    'max_depth': [None, 5, 10, 20],
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 20)
}

# Create a base RandomForestClassifier model
rf = RandomForestClassifier(random_state=42)

# Run RandomizedSearchCV with the defined parameter distributions
search = RandomizedSearchCV(rf, param_distributions, n_iter=20, cv=5, random_state=42)
search.fit(X, y)

print(f"Best score with param_distributions: {search.best_score_:.3f}")
print(f"Best parameters with param_distributions: {search.best_params_}")

# Run RandomizedSearchCV without specifying parameter distributions
search_default = RandomizedSearchCV(rf, {}, n_iter=20, cv=5, random_state=42)
search_default.fit(X, y)

print(f"\nBest score without param_distributions: {search_default.best_score_:.3f}")
print(f"Best parameters without param_distributions: {search_default.best_params_}")

Running the example gives an output like:

Best score with param_distributions: 0.710
Best parameters with param_distributions: {'max_depth': 10, 'min_samples_leaf': 2, 'min_samples_split': 13, 'n_estimators': 34}
Best score without param_distributions: 0.730
Best parameters without param_distributions: {}

The steps are as follows:

Generate a synthetic multiclass classification dataset using make_classification().
Define the param_distributions dictionary with distributions for RandomForestClassifier hyperparameters.
Create a base RandomForestClassifier model.
Run RandomizedSearchCV with the defined param_distributions and report the best score and parameters.
Run RandomizedSearchCV without specifying param_distributions to contrast the results.
Compare the best scores and parameters with and without param_distributions.

See Also