The n_jobs
parameter in scikit-learn’s StackingClassifier
controls the number of parallel jobs to run for both fitting and predicting.
StackingClassifier
is an ensemble method that combines multiple base classifiers by training a meta-classifier on their predictions. It can significantly improve model performance by leveraging the strengths of different algorithms.
The n_jobs
parameter determines how many CPU cores are used for parallel processing. Setting it to a value greater than 1 can speed up training and prediction times, especially for large datasets or complex models.
By default, n_jobs
is set to None
, which means it uses a single core. Common values include -1 (use all available cores), 2, 4, or 8, depending on the system’s capabilities and the task at hand.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import StackingClassifier
from sklearn.metrics import accuracy_score
import time
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=10,
n_redundant=0, n_classes=3, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define base estimators
estimators = [
('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
('svm', SVC(kernel='rbf', probability=True, random_state=42))
]
# Define n_jobs values to test
n_jobs_values = [1, 2, 4, -1]
for n_jobs in n_jobs_values:
start_time = time.perf_counter()
# Create and train StackingClassifier
stacking_clf = StackingClassifier(
estimators=estimators,
final_estimator=LogisticRegression(),
n_jobs=n_jobs)
stacking_clf.fit(X_train, y_train)
# Make predictions and calculate accuracy
y_pred = stacking_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
end_time = time.perf_counter()
execution_time = end_time - start_time
print(f"n_jobs={n_jobs}, Accuracy: {accuracy:.3f}, Execution Time: {execution_time:.2f} seconds")
Running the example gives an output like:
n_jobs=1, Accuracy: 0.875, Execution Time: 1.78 seconds
n_jobs=2, Accuracy: 0.875, Execution Time: 2.03 seconds
n_jobs=4, Accuracy: 0.875, Execution Time: 2.71 seconds
n_jobs=-1, Accuracy: 0.875, Execution Time: 2.79 seconds
Key steps in this example:
- Generate a synthetic multi-class classification dataset
- Split the data into train and test sets
- Define base estimators (RandomForestClassifier and SVC)
- Create StackingClassifier with different n_jobs values
- Train the model and measure execution time
- Evaluate accuracy on the test set
Tips and heuristics for setting n_jobs
:
- Start with -1 to use all available cores, then adjust based on system performance
- For small datasets or simple models, using multiple cores may not provide significant speedup
- Consider the trade-off between execution time and system resource usage
Issues to consider:
- Using too many cores can lead to increased memory usage
- The optimal
n_jobs
value depends on your specific hardware and the complexity of your model - Some operations may not benefit from parallelization, so always benchmark to ensure performance gains