Configure StackingClassifier "n_jobs" Parameter

The n_jobs parameter in scikit-learn’s StackingClassifier controls the number of parallel jobs to run for both fitting and predicting.

StackingClassifier is an ensemble method that combines multiple base classifiers by training a meta-classifier on their predictions. It can significantly improve model performance by leveraging the strengths of different algorithms.

The n_jobs parameter determines how many CPU cores are used for parallel processing. Setting it to a value greater than 1 can speed up training and prediction times, especially for large datasets or complex models.

By default, n_jobs is set to None, which means it uses a single core. Common values include -1 (use all available cores), 2, 4, or 8, depending on the system’s capabilities and the task at hand.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import StackingClassifier
from sklearn.metrics import accuracy_score
import time

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=10,
                           n_redundant=0, n_classes=3, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define base estimators
estimators = [
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
    ('svm', SVC(kernel='rbf', probability=True, random_state=42))
]

# Define n_jobs values to test
n_jobs_values = [1, 2, 4, -1]

for n_jobs in n_jobs_values:
    start_time = time.perf_counter()

    # Create and train StackingClassifier
    stacking_clf = StackingClassifier(
        estimators=estimators,
        final_estimator=LogisticRegression(),
        n_jobs=n_jobs)
    stacking_clf.fit(X_train, y_train)

    # Make predictions and calculate accuracy
    y_pred = stacking_clf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)

    end_time = time.perf_counter()
    execution_time = end_time - start_time

    print(f"n_jobs={n_jobs}, Accuracy: {accuracy:.3f}, Execution Time: {execution_time:.2f} seconds")

Running the example gives an output like:

n_jobs=1, Accuracy: 0.875, Execution Time: 1.78 seconds
n_jobs=2, Accuracy: 0.875, Execution Time: 2.03 seconds
n_jobs=4, Accuracy: 0.875, Execution Time: 2.71 seconds
n_jobs=-1, Accuracy: 0.875, Execution Time: 2.79 seconds

Key steps in this example:

Generate a synthetic multi-class classification dataset
Split the data into train and test sets
Define base estimators (RandomForestClassifier and SVC)
Create StackingClassifier with different n_jobs values
Train the model and measure execution time
Evaluate accuracy on the test set

Tips and heuristics for setting n_jobs:

Start with -1 to use all available cores, then adjust based on system performance
For small datasets or simple models, using multiple cores may not provide significant speedup
Consider the trade-off between execution time and system resource usage

Issues to consider:

Using too many cores can lead to increased memory usage
The optimal n_jobs value depends on your specific hardware and the complexity of your model
Some operations may not benefit from parallelization, so always benchmark to ensure performance gains

See Also