SKLearner Home | About | Contact | Examples

Configure RandomForestRegressor "n_jobs" Parameter

The n_jobs parameter in scikit-learn’s RandomForestRegressor controls the number of jobs to run in parallel for both fit and predict.

Random Forest is an ensemble learning method that trains multiple decision trees independently. The n_jobs parameter determines how many of these trees can be trained in parallel, which can significantly speed up the training process on multi-core machines.

By default, n_jobs is set to None, which means it will use 1 job (i.e., no parallelism). Setting n_jobs to -1 will use all available cores.

In practice, setting n_jobs to -1 is a common choice to fully utilize the machine’s resources. However, the optimal value depends on the specific computational resources available and the size of the dataset.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
import time

# Generate synthetic dataset
X, y = make_regression(n_samples=10000, n_features=20, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different n_jobs values
n_jobs_values = [-1, 1, 2, 4]
train_times = []
mse_scores = []

for n in n_jobs_values:
    start = time.time()
    rf = RandomForestRegressor(n_estimators=100, n_jobs=n, random_state=42)
    rf.fit(X_train, y_train)
    end = time.time()
    train_times.append(end - start)

    y_pred = rf.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)

    print(f"n_jobs={n}, Train Time: {end - start:.2f}s, MSE: {mse:.3f}")

Running the example gives an output like:

n_jobs=-1, Train Time: 2.70s, MSE: 4810.929
n_jobs=1, Train Time: 11.83s, MSE: 4810.929
n_jobs=2, Train Time: 6.24s, MSE: 4810.929
n_jobs=4, Train Time: 3.88s, MSE: 4810.929

The key steps in this example are:

  1. Generate a synthetic regression dataset
  2. Split the data into train and test sets
  3. Train RandomForestRegressor models with different n_jobs values
  4. Measure the training time for each model
  5. Evaluate the mean squared error of each model on the test set

Some tips and heuristics for setting n_jobs:

Issues to consider:



See Also