The n_jobs
parameter in scikit-learn’s RandomForestClassifier
controls the number of jobs to run in parallel for both fit
and predict
.
Random Forest is an ensemble learning method that trains multiple decision trees and combines their predictions to improve generalization performance.
The n_jobs
parameter determines the number of CPU cores used for parallel processing. Setting it to -1 uses all available cores, while a positive integer specifies the exact number of jobs to run in parallel.
The default value for n_jobs
is -1, which is commonly used to ensure all cores are used.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import time
# Generate synthetic dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=10,
n_redundant=5, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different n_jobs values
n_jobs_values = [-1, 1, 2, 4]
accuracies = []
times = []
for n in n_jobs_values:
start = time.time()
rf = RandomForestClassifier(n_estimators=100, n_jobs=n, random_state=42)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
end = time.time()
runtime = end - start
accuracies.append(accuracy)
times.append(runtime)
print(f"n_jobs={n}, Accuracy: {accuracy:.3f}, Runtime: {runtime:.2f} seconds")
Running this example gives an output like:
n_jobs=-1, Accuracy: 0.931, Runtime: 0.64 seconds
n_jobs=1, Accuracy: 0.931, Runtime: 2.51 seconds
n_jobs=2, Accuracy: 0.931, Runtime: 1.34 seconds
n_jobs=4, Accuracy: 0.931, Runtime: 0.78 seconds
The key steps in this example are:
- Generate a large synthetic binary classification dataset
- Split the data into train and test sets
- Train
RandomForestClassifier
models with differentn_jobs
values - Evaluate the accuracy and runtime of each model
Some tips and heuristics for setting n_jobs
:
- Use -1 to utilize all available CPU cores for maximum parallelization
- Positive integers specify the exact number of jobs to run in parallel
- Parallelization can significantly speed up training on multi-core machines
Issues to consider:
- The optimal setting depends on the number of cores and memory available
- Using all cores can slow down other processes running on the machine
- With many jobs and large datasets, memory limits may be hit, causing slowdowns