The n_jobs
parameter in scikit-learn’s StackingRegressor
controls the number of parallel jobs to run for both fitting the base estimators and the final estimator.
StackingRegressor
is an ensemble method that combines multiple base regressors by training a final regressor on their predictions. The n_jobs
parameter allows you to leverage multiple CPU cores to speed up the training process.
Setting n_jobs
to a value greater than 1 enables parallel processing, which can significantly reduce training time, especially for large datasets or complex base estimators. However, the optimal value depends on your hardware and the nature of your data and models.
The default value for n_jobs
is None
, which means it will use a single core. Common values include -1 (use all available cores), 2, 4, or 8, depending on the number of cores available on your machine.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, StackingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import time
import numpy as np
# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define base models and final estimator
base_models = [
('rf', RandomForestRegressor(n_estimators=100, random_state=42)),
('gb', GradientBoostingRegressor(n_estimators=100, random_state=42))
]
final_estimator = LinearRegression()
# Train with different n_jobs values
n_jobs_values = [1, 2, 4, -1]
results = []
for n_jobs in n_jobs_values:
start_time = time.perf_counter()
stacking_regressor = StackingRegressor(
estimators=base_models,
final_estimator=final_estimator,
n_jobs=n_jobs
)
stacking_regressor.fit(X_train, y_train)
training_time = time.perf_counter() - start_time
y_pred = stacking_regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
results.append((n_jobs, training_time, mse))
print(f"n_jobs={n_jobs}, Training Time: {training_time:.2f}s, MSE: {mse:.4f}")
# Find best result
best_result = min(results, key=lambda x: x[1]) # Minimum training time
print(f"\nBest n_jobs: {best_result[0]}, Training Time: {best_result[1]:.2f}s, MSE: {best_result[2]:.4f}")
Running the example gives an output like:
n_jobs=1, Training Time: 7.08s, MSE: 2289.5273
n_jobs=2, Training Time: 4.37s, MSE: 2289.5273
n_jobs=4, Training Time: 4.52s, MSE: 2289.5273
n_jobs=-1, Training Time: 4.27s, MSE: 2289.5273
Best n_jobs: -1, Training Time: 4.27s, MSE: 2289.5273
The key steps in this example are:
- Generate a synthetic regression dataset
- Split the data into train and test sets
- Define base models (RandomForestRegressor and GradientBoostingRegressor) and final estimator (LinearRegression)
- Train
StackingRegressor
models with differentn_jobs
values - Measure training time and model performance (MSE) for each
n_jobs
setting - Compare results to find the optimal
n_jobs
value
Some tips and heuristics for setting n_jobs
:
- Start with
n_jobs=-1
to use all available cores, then experiment with specific values - For small datasets or simple models, the overhead of parallelization might outweigh the benefits
- Consider your system’s available resources and other running processes when setting
n_jobs
Issues to consider:
- Using a high
n_jobs
value can consume significant system resources, potentially slowing down other processes - The speedup may not be linear with the number of cores due to communication overhead
- Some algorithms or operations may not benefit from parallelization, so always benchmark to confirm improvements