Configure LogisticRegression "n_jobs" Parameter

The n_jobs parameter in scikit-learn’s LogisticRegression specifies the number of CPUs to use during computation.

Logistic Regression is a linear model used for binary classification tasks. It estimates the probability that a given input point belongs to a certain class.

The n_jobs parameter can significantly speed up the training process by using multiple CPU cores. Setting n_jobs to -1 uses all available cores.

The default value for n_jobs is 1, meaning no parallelism. Common values include -1 (for all cores), and specific numbers depending on the system’s capabilities.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import time

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different n_jobs values
n_jobs_values = [1, 2, -1]
accuracies = []
training_times = []

for n in n_jobs_values:
    lr = LogisticRegression(n_jobs=n, random_state=42, max_iter=200)
    start_time = time.time()
    lr.fit(X_train, y_train)
    training_time = time.time() - start_time
    y_pred = lr.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    training_times.append(training_time)
    print(f"n_jobs={n}, Accuracy: {accuracy:.3f}, Training Time: {training_time:.3f} seconds")

Running the example gives an output like:

n_jobs=1, Accuracy: 0.810, Training Time: 0.005 seconds
n_jobs=2, Accuracy: 0.810, Training Time: 0.720 seconds
n_jobs=-1, Accuracy: 0.810, Training Time: 0.917 seconds

The key steps in this example are:

Generate a synthetic binary classification dataset with informative features.
Split the data into train and test sets.
Train LogisticRegression models with different n_jobs values.
Measure and compare the training time and accuracy of each model.

Some tips and heuristics for setting n_jobs:

Using n_jobs=-1 maximizes parallelism by utilizing all available CPU cores.
For systems with fewer cores, set n_jobs to a specific number to avoid overloading.
Parallelism benefits larger datasets more due to higher computation requirements.

Issues to consider:

The optimal n_jobs setting depends on the system’s CPU configuration and other running processes.
Too many parallel jobs can lead to resource contention, affecting performance.
Evaluate the trade-off between training speed and system resource usage.

See Also