The n_jobs
parameter in scikit-learn’s LogisticRegression
specifies the number of CPUs to use during computation.
Logistic Regression is a linear model used for binary classification tasks. It estimates the probability that a given input point belongs to a certain class.
The n_jobs
parameter can significantly speed up the training process by using multiple CPU cores. Setting n_jobs
to -1 uses all available cores.
The default value for n_jobs
is 1, meaning no parallelism. Common values include -1 (for all cores), and specific numbers depending on the system’s capabilities.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import time
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different n_jobs values
n_jobs_values = [1, 2, -1]
accuracies = []
training_times = []
for n in n_jobs_values:
lr = LogisticRegression(n_jobs=n, random_state=42, max_iter=200)
start_time = time.time()
lr.fit(X_train, y_train)
training_time = time.time() - start_time
y_pred = lr.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracies.append(accuracy)
training_times.append(training_time)
print(f"n_jobs={n}, Accuracy: {accuracy:.3f}, Training Time: {training_time:.3f} seconds")
Running the example gives an output like:
n_jobs=1, Accuracy: 0.810, Training Time: 0.005 seconds
n_jobs=2, Accuracy: 0.810, Training Time: 0.720 seconds
n_jobs=-1, Accuracy: 0.810, Training Time: 0.917 seconds
The key steps in this example are:
- Generate a synthetic binary classification dataset with informative features.
- Split the data into train and test sets.
- Train
LogisticRegression
models with differentn_jobs
values. - Measure and compare the training time and accuracy of each model.
Some tips and heuristics for setting n_jobs
:
- Using
n_jobs=-1
maximizes parallelism by utilizing all available CPU cores. - For systems with fewer cores, set
n_jobs
to a specific number to avoid overloading. - Parallelism benefits larger datasets more due to higher computation requirements.
Issues to consider:
- The optimal
n_jobs
setting depends on the system’s CPU configuration and other running processes. - Too many parallel jobs can lead to resource contention, affecting performance.
- Evaluate the trade-off between training speed and system resource usage.