The n_jobs
parameter in scikit-learn’s ExtraTreesClassifier
controls the number of parallel jobs to run for both fitting and prediction.
Extra Trees Classifier is an ensemble method that builds multiple decision trees and combines their predictions. The n_jobs
parameter determines how many processors are used to build trees in parallel.
Setting n_jobs
to a value greater than 1 can significantly speed up training and prediction times, especially for large datasets or when building many trees. However, it may not always lead to faster execution due to overhead.
The default value for n_jobs
is None
, which means it will use 1 processor. Setting it to -1 uses all available processors.
Common values for n_jobs
include 1 (no parallelism), -1 (all processors), or a specific number based on available CPU cores (e.g., 2, 4, or 8).
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import accuracy_score
import time
# Generate synthetic dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=10,
n_redundant=5, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different n_jobs values
n_jobs_values = [1, 2, 4, -1]
results = []
for n_jobs in n_jobs_values:
start_time = time.time()
etc = ExtraTreesClassifier(n_estimators=100, random_state=42, n_jobs=n_jobs)
etc.fit(X_train, y_train)
train_time = time.time() - start_time
y_pred = etc.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
results.append((n_jobs, train_time, accuracy))
print(f"n_jobs={n_jobs}, Training Time: {train_time:.2f}s, Accuracy: {accuracy:.3f}")
Running the example gives an output like:
n_jobs=1, Training Time: 0.58s, Accuracy: 0.939
n_jobs=2, Training Time: 0.36s, Accuracy: 0.939
n_jobs=4, Training Time: 0.23s, Accuracy: 0.939
n_jobs=-1, Training Time: 0.20s, Accuracy: 0.939
The key steps in this example are:
- Generate a synthetic classification dataset with informative and redundant features
- Split the data into train and test sets
- Train
ExtraTreesClassifier
models with differentn_jobs
values - Measure training time and accuracy for each model
- Compare the results to see the effect of parallelization
Some tips and heuristics for setting n_jobs
:
- Start with the default value (None) and experiment with increasing parallelization
- Use -1 to utilize all available processors, but be aware of system resource constraints
- For smaller datasets or fewer trees, the overhead of parallelization may outweigh the benefits
Issues to consider:
- The optimal
n_jobs
value depends on your hardware and the size of your dataset - Increasing
n_jobs
may not always lead to faster execution due to communication overhead - Setting
n_jobs
too high can slow down other processes on your system - Memory usage increases with the number of parallel jobs