The param_distributions
parameter in RandomizedSearchCV
is a dictionary that specifies the hyperparameter search space for the random search process. Random search is a hyperparameter optimization method that samples random combinations of hyperparameters from the provided distributions to find the best performing model.
By default, param_distributions
is set to None, which means it must be explicitly specified by the user. The keys of the dictionary are the names of the hyperparameters, and the values are the corresponding distributions or lists of values to sample from.
When choosing distributions for hyperparameters, it’s important to consider their type and range. For continuous parameters, such as learning rate or regularization strength, a uniform or log-uniform distribution can be used. For integer parameters, such as the number of estimators or maximum depth, an integer uniform distribution or a list of specific values can be used. For categorical parameters, a list of the possible values should be provided.
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
# Generate a synthetic multiclass classification dataset
X, y = make_classification(n_samples=100, n_classes=3, n_informative=5, n_redundant=5, random_state=42)
# Define the parameter distributions for RandomForestClassifier
param_distributions = {
'n_estimators': randint(5, 50),
'max_depth': [None, 5, 10, 20],
'min_samples_split': randint(2, 20),
'min_samples_leaf': randint(1, 20)
}
# Create a base RandomForestClassifier model
rf = RandomForestClassifier(random_state=42)
# Run RandomizedSearchCV with the defined parameter distributions
search = RandomizedSearchCV(rf, param_distributions, n_iter=20, cv=5, random_state=42)
search.fit(X, y)
print(f"Best score with param_distributions: {search.best_score_:.3f}")
print(f"Best parameters with param_distributions: {search.best_params_}")
# Run RandomizedSearchCV without specifying parameter distributions
search_default = RandomizedSearchCV(rf, {}, n_iter=20, cv=5, random_state=42)
search_default.fit(X, y)
print(f"\nBest score without param_distributions: {search_default.best_score_:.3f}")
print(f"Best parameters without param_distributions: {search_default.best_params_}")
Running the example gives an output like:
Best score with param_distributions: 0.710
Best parameters with param_distributions: {'max_depth': 10, 'min_samples_leaf': 2, 'min_samples_split': 13, 'n_estimators': 34}
Best score without param_distributions: 0.730
Best parameters without param_distributions: {}
The steps are as follows:
- Generate a synthetic multiclass classification dataset using
make_classification()
. - Define the
param_distributions
dictionary with distributions forRandomForestClassifier
hyperparameters. - Create a base
RandomForestClassifier
model. - Run
RandomizedSearchCV
with the definedparam_distributions
and report the best score and parameters. - Run
RandomizedSearchCV
without specifyingparam_distributions
to contrast the results. - Compare the best scores and parameters with and without
param_distributions
.