Configure SGDClassifier "class_weight" Parameter

The class_weight parameter in scikit-learn’s SGDClassifier adjusts the importance of classes during training, which is particularly useful for imbalanced datasets.

SGDClassifier (Stochastic Gradient Descent Classifier) is a linear classifier that uses stochastic gradient descent for optimization. It’s efficient for large-scale learning and supports different loss functions.

The class_weight parameter modifies the update step for each class, effectively giving more importance to samples from the minority class. This helps prevent the classifier from being biased towards the majority class.

By default, class_weight is set to None, treating all classes equally. Common options include ‘balanced’ (automatically adjusts weights inversely proportional to class frequencies) and a custom dictionary specifying weights for each class.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import f1_score

# Generate imbalanced dataset
X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.9, 0.1],
                           n_features=20, n_informative=3, n_redundant=0,
                           random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different class_weight settings
class_weights = [None, 'balanced', {0:1, 1:9}]
for weight in class_weights:
    clf = SGDClassifier(class_weight=weight, random_state=42)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    f1 = f1_score(y_test, y_pred)
    print(f"class_weight={weight}, F1-score: {f1:.3f}")

Running the example gives an output like:

class_weight=None, F1-score: 0.703
class_weight=balanced, F1-score: 0.516
class_weight={0: 1, 1: 9}, F1-score: 0.510

The key steps in this example are:

Generate a synthetic imbalanced binary classification dataset
Split the data into train and test sets
Train SGDClassifier models with different class_weight settings
Evaluate each model’s performance using F1-score

Tips and heuristics for setting class_weight:

Use ‘balanced’ as a starting point for imbalanced datasets
For severe imbalances, consider a custom dictionary with higher weight for the minority class
Experiment with different weights and evaluate performance on a validation set

Issues to consider:

Adjusting class weights may lead to longer training times
Extreme weight adjustments can result in overfitting to the minority class
The optimal class weights depend on the specific dataset and problem

See Also