Configure RandomForestClassifier "n_jobs" Parameter

The n_jobs parameter in scikit-learn’s RandomForestClassifier controls the number of jobs to run in parallel for both fit and predict.

Random Forest is an ensemble learning method that trains multiple decision trees and combines their predictions to improve generalization performance.

The n_jobs parameter determines the number of CPU cores used for parallel processing. Setting it to -1 uses all available cores, while a positive integer specifies the exact number of jobs to run in parallel.

The default value for n_jobs is -1, which is commonly used to ensure all cores are used.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import time

# Generate synthetic dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=10,
                           n_redundant=5, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different n_jobs values
n_jobs_values = [-1, 1, 2, 4]
accuracies = []
times = []

for n in n_jobs_values:
    start = time.time()
    rf = RandomForestClassifier(n_estimators=100, n_jobs=n, random_state=42)
    rf.fit(X_train, y_train)
    y_pred = rf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    end = time.time()
    runtime = end - start
    accuracies.append(accuracy)
    times.append(runtime)
    print(f"n_jobs={n}, Accuracy: {accuracy:.3f}, Runtime: {runtime:.2f} seconds")

Running this example gives an output like:

n_jobs=-1, Accuracy: 0.931, Runtime: 0.64 seconds
n_jobs=1, Accuracy: 0.931, Runtime: 2.51 seconds
n_jobs=2, Accuracy: 0.931, Runtime: 1.34 seconds
n_jobs=4, Accuracy: 0.931, Runtime: 0.78 seconds

The key steps in this example are:

Generate a large synthetic binary classification dataset
Split the data into train and test sets
Train RandomForestClassifier models with different n_jobs values
Evaluate the accuracy and runtime of each model

Some tips and heuristics for setting n_jobs:

Use -1 to utilize all available CPU cores for maximum parallelization
Positive integers specify the exact number of jobs to run in parallel
Parallelization can significantly speed up training on multi-core machines

Issues to consider:

The optimal setting depends on the number of cores and memory available
Using all cores can slow down other processes running on the machine
With many jobs and large datasets, memory limits may be hit, causing slowdowns

See Also