Configure BaggingClassifier "n_jobs" Parameter

The n_jobs parameter in scikit-learn’s BaggingClassifier controls the number of jobs to run in parallel for both fit and predict methods.

BaggingClassifier is an ensemble meta-estimator that fits base classifiers each on random subsets of the original dataset and then aggregates their individual predictions to form a final prediction. This technique helps reduce overfitting and improves the stability and accuracy of machine learning algorithms.

The n_jobs parameter determines how many processors are used to fit and predict. A value of -1 means using all processors, while a value of 1 means using a single processor.

The default value for n_jobs is None, which means using one processor. Common values include -1 (all processors), 1 (single processor), or specific numbers like 2, 4, or 8, depending on the available hardware.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import time

# Generate synthetic dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=15,
                           n_redundant=0, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different n_jobs values
n_jobs_values = [1, 2, 4, -1]
accuracies = []
training_times = []

for n_jobs in n_jobs_values:
    start_time = time.time()
    bc = BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=10,
                           n_jobs=n_jobs, random_state=42)
    bc.fit(X_train, y_train)
    training_time = time.time() - start_time

    y_pred = bc.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)

    accuracies.append(accuracy)
    training_times.append(training_time)
    print(f"n_jobs={n_jobs}, Accuracy: {accuracy:.3f}, Training time: {training_time:.2f} seconds")

Running the example gives an output like:

n_jobs=1, Accuracy: 0.896, Training time: 1.30 seconds
n_jobs=2, Accuracy: 0.896, Training time: 1.60 seconds
n_jobs=4, Accuracy: 0.896, Training time: 1.67 seconds
n_jobs=-1, Accuracy: 0.896, Training time: 2.12 seconds

The key steps in this example are:

Generate a synthetic classification dataset
Split the data into train and test sets
Train BaggingClassifier models with different n_jobs values
Measure and compare both accuracy and training time for each model

Tips and heuristics for setting n_jobs:

Use -1 to utilize all available processors for maximum parallelization
Start with a lower value and increase gradually to find the optimal setting for your hardware
Consider the trade-off between training speed and resource consumption

Issues to consider:

The optimal n_jobs value depends on your hardware capabilities and workload
Using too many jobs may lead to slower performance due to overhead in task management
For small datasets, the overhead of parallelization might outweigh the benefits
Be mindful of system resources when running multiple parallel jobs, especially in shared environments

See Also