RandomizedSearchCV is a versatile tool for hyperparameter tuning that allows you to search over a parameter space using random sampling.
One important attribute to consider when configuring RandomizedSearchCV is “n_splits_”, which determines the number of splits used for cross-validation during the search.
Adjusting “n_splits_” allows you to control the trade-off between computational cost and statistical performance. A higher value of “n_splits_” will result in more robust performance estimates but will also increase computation time.
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
import time
# Generate a synthetic classification dataset
X, y = make_classification(n_samples=100, n_features=10, n_informative=5, n_redundant=5, random_state=42)
# Set up a RandomForestClassifier
rf = RandomForestClassifier(random_state=42)
# Define a parameter distribution to search over
param_dist = {'n_estimators': randint(5, 50),
'max_depth': [3, 5, 10, None]}
# Compare different values of n_splits_
for n_splits in [2, 5, 10]:
start_time = time.time()
random_search = RandomizedSearchCV(rf, param_distributions=param_dist, n_iter=10, cv=n_splits, random_state=42)
random_search.fit(X, y)
end_time = time.time()
print(f"n_splits_={n_splits}, Time: {end_time - start_time:.2f}s, Best score: {random_search.best_score_:.3f}")
Running the example gives an output like:
n_splits_=2, Time: 0.70s, Best score: 0.840
n_splits_=5, Time: 1.62s, Best score: 0.860
n_splits_=10, Time: 3.37s, Best score: 0.860
The steps in this example are:
- Generate a synthetic classification dataset using
make_classification
. - Set up a
RandomForestClassifier
and define a parameter distribution to search over. - Iterate over different values of “n_splits_” (2, 5, and 10).
- For each “n_splits_” value:
- Create a
RandomizedSearchCV
object with the specified “n_splits_” value. - Fit the random search object and record the computation time and best score.
- Create a
- Compare the runtime and best score for each “n_splits_” value.
By comparing the results for different “n_splits_” values, you can observe the impact on computation time and the best score achieved. This can help you choose an appropriate value that balances computational cost and statistical performance for your specific problem and resources.