Configure ExtraTreesRegressor "n_jobs" Parameter

The n_jobs parameter in scikit-learn’s ExtraTreesRegressor controls the number of parallel jobs to run for both fitting and predicting.

ExtraTreesRegressor is an ensemble method that fits a number of randomized decision trees on various sub-samples of the dataset and uses averaging to improve predictive accuracy and control over-fitting.

The n_jobs parameter determines how many processors are used to fit and predict trees in parallel. A value of -1 uses all available processors, while a positive integer specifies the exact number of processors to use.

By default, n_jobs is set to None, which means it will use a single processor. Common values include -1 for all processors, or specific numbers like 2, 4, or 8, depending on the system’s capabilities.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.metrics import mean_squared_error
import time

# Generate synthetic dataset
X, y = make_regression(n_samples=10000, n_features=20, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different n_jobs values
n_jobs_values = [1, 2, 4, -1]
mse_scores = []
training_times = []

for n_jobs in n_jobs_values:
    start_time = time.time()
    etr = ExtraTreesRegressor(n_estimators=100, random_state=42, n_jobs=n_jobs)
    etr.fit(X_train, y_train)
    training_time = time.time() - start_time

    y_pred = etr.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)

    mse_scores.append(mse)
    training_times.append(training_time)

    print(f"n_jobs={n_jobs}, MSE: {mse:.4f}, Training Time: {training_time:.2f} seconds")

Running the example gives an output like:

n_jobs=1, MSE: 3873.3552, Training Time: 3.24 seconds
n_jobs=2, MSE: 3873.3552, Training Time: 1.76 seconds
n_jobs=4, MSE: 3873.3552, Training Time: 1.17 seconds
n_jobs=-1, MSE: 3873.3552, Training Time: 0.85 seconds

The key steps in this example are:

Generate a synthetic regression dataset
Split the data into train and test sets
Train ExtraTreesRegressor models with different n_jobs values
Measure training time for each model
Evaluate the mean squared error of each model on the test set
Compare performance and training times across different n_jobs settings

Some tips and heuristics for setting n_jobs:

Use -1 to utilize all available processors for maximum parallelization
For large datasets, increasing n_jobs can significantly reduce training time
Consider system resources and other running processes when setting n_jobs
Experiment with different values to find the optimal balance between speed and resource usage

Issues to consider:

Using more jobs increases memory usage, which can be a limitation on some systems
The speedup may not be linear due to overhead in parallelization
For small datasets, the overhead of parallelization might outweigh the benefits
The optimal n_jobs value depends on your specific hardware and workload

See Also