Configure ExtraTreesClassifier "n_jobs" Parameter

The n_jobs parameter in scikit-learn’s ExtraTreesClassifier controls the number of parallel jobs to run for both fitting and prediction.

Extra Trees Classifier is an ensemble method that builds multiple decision trees and combines their predictions. The n_jobs parameter determines how many processors are used to build trees in parallel.

Setting n_jobs to a value greater than 1 can significantly speed up training and prediction times, especially for large datasets or when building many trees. However, it may not always lead to faster execution due to overhead.

The default value for n_jobs is None, which means it will use 1 processor. Setting it to -1 uses all available processors.

Common values for n_jobs include 1 (no parallelism), -1 (all processors), or a specific number based on available CPU cores (e.g., 2, 4, or 8).

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import accuracy_score
import time

# Generate synthetic dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=10,
                           n_redundant=5, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different n_jobs values
n_jobs_values = [1, 2, 4, -1]
results = []

for n_jobs in n_jobs_values:
    start_time = time.time()
    etc = ExtraTreesClassifier(n_estimators=100, random_state=42, n_jobs=n_jobs)
    etc.fit(X_train, y_train)
    train_time = time.time() - start_time

    y_pred = etc.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)

    results.append((n_jobs, train_time, accuracy))
    print(f"n_jobs={n_jobs}, Training Time: {train_time:.2f}s, Accuracy: {accuracy:.3f}")

Running the example gives an output like:

n_jobs=1, Training Time: 0.58s, Accuracy: 0.939
n_jobs=2, Training Time: 0.36s, Accuracy: 0.939
n_jobs=4, Training Time: 0.23s, Accuracy: 0.939
n_jobs=-1, Training Time: 0.20s, Accuracy: 0.939

The key steps in this example are:

Generate a synthetic classification dataset with informative and redundant features
Split the data into train and test sets
Train ExtraTreesClassifier models with different n_jobs values
Measure training time and accuracy for each model
Compare the results to see the effect of parallelization

Some tips and heuristics for setting n_jobs:

Start with the default value (None) and experiment with increasing parallelization
Use -1 to utilize all available processors, but be aware of system resource constraints
For smaller datasets or fewer trees, the overhead of parallelization may outweigh the benefits

Issues to consider:

The optimal n_jobs value depends on your hardware and the size of your dataset
Increasing n_jobs may not always lead to faster execution due to communication overhead
Setting n_jobs too high can slow down other processes on your system
Memory usage increases with the number of parallel jobs

See Also