SKLearner Home | About | Contact | Examples

Scikit-Learn RandomizedSearchCV StackingRegressor

Hyperparameter tuning is essential for optimizing machine learning models. In this example, we’ll demonstrate how to use scikit-learn’s RandomizedSearchCV for hyperparameter tuning of a StackingRegressor, an ensemble method that combines multiple regression models to improve predictions.

Random search is a method for evaluating different combinations of model hyperparameters. Unlike grid search, it samples a fixed number of hyperparameter combinations from a specified distribution, making it more efficient when searching over a large hyperparameter space.

StackingRegressor is an ensemble method that combines predictions of multiple base regressors. It improves model accuracy by leveraging the strengths of different models.

Key hyperparameters for StackingRegressor include the list of base regressors (estimators), the regressor used to combine the base estimators (final_estimator), and the number of cross-validation folds (cv).

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import StackingRegressor, RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import LinearRegression, Ridge
from scipy.stats import randint

# Generate synthetic regression dataset
X, y = make_regression(n_samples=100, n_features=20, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define base regressors and stacking model
base_regressors = [
    ('rf', RandomForestRegressor(random_state=42)),
    ('gb', GradientBoostingRegressor(random_state=42))
]
stacking_model = StackingRegressor(
    estimators=base_regressors,
    final_estimator=LinearRegression()
)

# Define hyperparameter distribution
param_dist = {
    'final_estimator': [LinearRegression(), Ridge()],
    'cv': randint(1, 3)
}

# Perform random search
random_search = RandomizedSearchCV(
    estimator=stacking_model,
    param_distributions=param_dist,
    n_iter=50,
    cv=5,
    scoring='r2',
    random_state=42
)
random_search.fit(X_train, y_train)

# Report best score and parameters
print(f"Best score: {random_search.best_score_:.3f}")
print(f"Best parameters: {random_search.best_params_}")

# Evaluate on test set
best_model = random_search.best_estimator_
test_score = best_model.score(X_test, y_test)
print(f"Test set R^2: {test_score:.3f}")

Running the example gives an output like:

Best score: 0.695
Best parameters: {'cv': 2, 'final_estimator': Ridge()}
Test set R^2: 0.565

The steps are as follows:

  1. Generate a synthetic regression dataset using make_regression.
  2. Split the dataset into train and test sets using train_test_split.
  3. Define base regressors (RandomForestRegressor, GradientBoostingRegressor) and create a StackingRegressor with these and a LinearRegression as the final estimator.
  4. Define the hyperparameter distribution with different final estimators and cross-validation folds.
  5. Perform random search using RandomizedSearchCV, specifying the StackingRegressor, hyperparameter distribution, 50 iterations, 5-fold cross-validation, and R² scoring metric.
  6. Report the best cross-validation score and best set of hyperparameters found by random search.
  7. Evaluate the best model on the hold-out test set and report the R² score.

By using RandomizedSearchCV, we can efficiently explore different hyperparameter settings for StackingRegressor and find the combination that maximizes the model’s performance, ensuring an optimal ensemble configuration for regression tasks.



See Also