The n_jobs
parameter in scikit-learn’s ExtraTreesRegressor
controls the number of parallel jobs to run for both fitting and predicting.
ExtraTreesRegressor
is an ensemble method that fits a number of randomized decision trees on various sub-samples of the dataset and uses averaging to improve predictive accuracy and control over-fitting.
The n_jobs
parameter determines how many processors are used to fit and predict trees in parallel. A value of -1 uses all available processors, while a positive integer specifies the exact number of processors to use.
By default, n_jobs
is set to None, which means it will use a single processor. Common values include -1 for all processors, or specific numbers like 2, 4, or 8, depending on the system’s capabilities.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.metrics import mean_squared_error
import time
# Generate synthetic dataset
X, y = make_regression(n_samples=10000, n_features=20, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different n_jobs values
n_jobs_values = [1, 2, 4, -1]
mse_scores = []
training_times = []
for n_jobs in n_jobs_values:
start_time = time.time()
etr = ExtraTreesRegressor(n_estimators=100, random_state=42, n_jobs=n_jobs)
etr.fit(X_train, y_train)
training_time = time.time() - start_time
y_pred = etr.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
mse_scores.append(mse)
training_times.append(training_time)
print(f"n_jobs={n_jobs}, MSE: {mse:.4f}, Training Time: {training_time:.2f} seconds")
Running the example gives an output like:
n_jobs=1, MSE: 3873.3552, Training Time: 3.24 seconds
n_jobs=2, MSE: 3873.3552, Training Time: 1.76 seconds
n_jobs=4, MSE: 3873.3552, Training Time: 1.17 seconds
n_jobs=-1, MSE: 3873.3552, Training Time: 0.85 seconds
The key steps in this example are:
- Generate a synthetic regression dataset
- Split the data into train and test sets
- Train
ExtraTreesRegressor
models with differentn_jobs
values - Measure training time for each model
- Evaluate the mean squared error of each model on the test set
- Compare performance and training times across different
n_jobs
settings
Some tips and heuristics for setting n_jobs
:
- Use -1 to utilize all available processors for maximum parallelization
- For large datasets, increasing
n_jobs
can significantly reduce training time - Consider system resources and other running processes when setting
n_jobs
- Experiment with different values to find the optimal balance between speed and resource usage
Issues to consider:
- Using more jobs increases memory usage, which can be a limitation on some systems
- The speedup may not be linear due to overhead in parallelization
- For small datasets, the overhead of parallelization might outweigh the benefits
- The optimal
n_jobs
value depends on your specific hardware and workload