The n_jobs
parameter in scikit-learn’s SGDClassifier
controls the number of CPU cores used for parallelization during training.
Stochastic Gradient Descent (SGD) is an optimization algorithm used for training various linear models. SGDClassifier
implements a plain stochastic gradient descent learning routine that supports different loss functions and penalties for classification.
The n_jobs
parameter determines how many CPU cores are used to parallelize the computation of gradient updates. A value of -1 uses all available cores, while a positive integer specifies the exact number of cores to use.
The default value for n_jobs
is None
, which means no parallelization is used. Common values include -1 (use all cores) or positive integers like 2, 4, or 8, depending on the available hardware.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score
import time
# Generate synthetic dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=10,
n_redundant=5, n_classes=3, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different n_jobs values
n_jobs_values = [None, 1, 2, 4, -1]
results = []
for n_jobs in n_jobs_values:
start_time = time.time()
sgd = SGDClassifier(n_jobs=n_jobs, random_state=42)
sgd.fit(X_train, y_train)
train_time = time.time() - start_time
y_pred = sgd.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
results.append((n_jobs, train_time, accuracy))
print(f"n_jobs={n_jobs}, Training Time: {train_time:.3f}s, Accuracy: {accuracy:.3f}")
Running the example gives an output like:
n_jobs=None, Training Time: 0.149s, Accuracy: 0.677
n_jobs=1, Training Time: 0.151s, Accuracy: 0.677
n_jobs=2, Training Time: 0.137s, Accuracy: 0.677
n_jobs=4, Training Time: 0.092s, Accuracy: 0.677
n_jobs=-1, Training Time: 0.084s, Accuracy: 0.677
The key steps in this example are:
- Generate a synthetic multi-class classification dataset
- Split the data into train and test sets
- Train
SGDClassifier
models with differentn_jobs
values - Measure training time and accuracy for each model
- Compare performance across different
n_jobs
configurations
Some tips and heuristics for setting n_jobs
:
- Use -1 to utilize all available CPU cores for maximum parallelization
- For large datasets, increasing
n_jobs
can significantly reduce training time - On smaller datasets, the overhead of parallelization might outweigh the benefits
Issues to consider:
- The optimal
n_jobs
value depends on your hardware and the size of your dataset - Using too many cores can lead to increased memory usage
- The speedup may not be linear due to communication overhead between processes
- For very small datasets, parallelization might actually slow down the training process