Configure SGDClassifier "n_jobs" Parameter

The n_jobs parameter in scikit-learn’s SGDClassifier controls the number of CPU cores used for parallelization during training.

Stochastic Gradient Descent (SGD) is an optimization algorithm used for training various linear models. SGDClassifier implements a plain stochastic gradient descent learning routine that supports different loss functions and penalties for classification.

The n_jobs parameter determines how many CPU cores are used to parallelize the computation of gradient updates. A value of -1 uses all available cores, while a positive integer specifies the exact number of cores to use.

The default value for n_jobs is None, which means no parallelization is used. Common values include -1 (use all cores) or positive integers like 2, 4, or 8, depending on the available hardware.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score
import time

# Generate synthetic dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=10,
                           n_redundant=5, n_classes=3, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different n_jobs values
n_jobs_values = [None, 1, 2, 4, -1]
results = []

for n_jobs in n_jobs_values:
    start_time = time.time()
    sgd = SGDClassifier(n_jobs=n_jobs, random_state=42)
    sgd.fit(X_train, y_train)
    train_time = time.time() - start_time

    y_pred = sgd.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)

    results.append((n_jobs, train_time, accuracy))
    print(f"n_jobs={n_jobs}, Training Time: {train_time:.3f}s, Accuracy: {accuracy:.3f}")

Running the example gives an output like:

n_jobs=None, Training Time: 0.149s, Accuracy: 0.677
n_jobs=1, Training Time: 0.151s, Accuracy: 0.677
n_jobs=2, Training Time: 0.137s, Accuracy: 0.677
n_jobs=4, Training Time: 0.092s, Accuracy: 0.677
n_jobs=-1, Training Time: 0.084s, Accuracy: 0.677

The key steps in this example are:

Generate a synthetic multi-class classification dataset
Split the data into train and test sets
Train SGDClassifier models with different n_jobs values
Measure training time and accuracy for each model
Compare performance across different n_jobs configurations

Some tips and heuristics for setting n_jobs:

Use -1 to utilize all available CPU cores for maximum parallelization
For large datasets, increasing n_jobs can significantly reduce training time
On smaller datasets, the overhead of parallelization might outweigh the benefits

Issues to consider:

The optimal n_jobs value depends on your hardware and the size of your dataset
Using too many cores can lead to increased memory usage
The speedup may not be linear due to communication overhead between processes
For very small datasets, parallelization might actually slow down the training process

See Also