SKLearner Home | About | Contact | Examples

Configure BaggingRegressor "n_jobs" Parameter

The n_jobs parameter in scikit-learn’s BaggingRegressor controls the number of jobs to run in parallel for both fit and predict.

Bagging is an ensemble method that combines predictions from multiple base estimators trained on random subsets of the original dataset. The n_jobs parameter determines how many of these base estimators are fit or used for prediction concurrently.

Setting n_jobs to a value greater than 1 can significantly speed up computation, especially for large datasets or complex base estimators. However, the optimal value depends on your hardware and the nature of your data and model.

The default value for n_jobs is None, which means it will use 1 processor. Common values include -1 (use all processors), 2, 4, or 8, depending on the number of available CPU cores.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingRegressor
from sklearn.metrics import mean_squared_error
import time
import numpy as np

# Generate synthetic dataset
X, y = make_regression(n_samples=10000, n_features=20, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different n_jobs values
n_jobs_values = [1, 2, 4, -1]
execution_times = []
mse_scores = []

for n_jobs in n_jobs_values:
    start_time = time.time()

    br = BaggingRegressor(n_estimators=100, n_jobs=n_jobs, random_state=42)
    br.fit(X_train, y_train)

    y_pred = br.predict(X_test)

    execution_time = time.time() - start_time
    mse = mean_squared_error(y_test, y_pred)

    execution_times.append(execution_time)
    mse_scores.append(mse)

    print(f"n_jobs={n_jobs}, Execution Time: {execution_time:.2f}s, MSE: {mse:.4f}")

# Find the best n_jobs value
best_n_jobs = n_jobs_values[np.argmin(execution_times)]
print(f"\nBest n_jobs value: {best_n_jobs}")

Running the example gives an output like:

n_jobs=1, Execution Time: 11.93s, MSE: 4793.0439
n_jobs=2, Execution Time: 8.44s, MSE: 4793.0439
n_jobs=4, Execution Time: 6.01s, MSE: 4793.0439
n_jobs=-1, Execution Time: 5.01s, MSE: 4793.0439

Best n_jobs value: -1

The key steps in this example are:

  1. Generate a synthetic regression dataset
  2. Split the data into train and test sets
  3. Train BaggingRegressor models with different n_jobs values
  4. Measure execution time and calculate mean squared error for each model
  5. Compare the performance and execution times to find the optimal n_jobs value

Some tips and heuristics for setting n_jobs:

Issues to consider:



See Also