The class_weight
parameter in scikit-learn’s SGDClassifier
adjusts the importance of classes during training, which is particularly useful for imbalanced datasets.
SGDClassifier
(Stochastic Gradient Descent Classifier) is a linear classifier that uses stochastic gradient descent for optimization. It’s efficient for large-scale learning and supports different loss functions.
The class_weight
parameter modifies the update step for each class, effectively giving more importance to samples from the minority class. This helps prevent the classifier from being biased towards the majority class.
By default, class_weight
is set to None
, treating all classes equally. Common options include ‘balanced’ (automatically adjusts weights inversely proportional to class frequencies) and a custom dictionary specifying weights for each class.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import f1_score
# Generate imbalanced dataset
X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.9, 0.1],
n_features=20, n_informative=3, n_redundant=0,
random_state=42)
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different class_weight settings
class_weights = [None, 'balanced', {0:1, 1:9}]
for weight in class_weights:
clf = SGDClassifier(class_weight=weight, random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
f1 = f1_score(y_test, y_pred)
print(f"class_weight={weight}, F1-score: {f1:.3f}")
Running the example gives an output like:
class_weight=None, F1-score: 0.703
class_weight=balanced, F1-score: 0.516
class_weight={0: 1, 1: 9}, F1-score: 0.510
The key steps in this example are:
- Generate a synthetic imbalanced binary classification dataset
- Split the data into train and test sets
- Train
SGDClassifier
models with differentclass_weight
settings - Evaluate each model’s performance using F1-score
Tips and heuristics for setting class_weight
:
- Use ‘balanced’ as a starting point for imbalanced datasets
- For severe imbalances, consider a custom dictionary with higher weight for the minority class
- Experiment with different weights and evaluate performance on a validation set
Issues to consider:
- Adjusting class weights may lead to longer training times
- Extreme weight adjustments can result in overfitting to the minority class
- The optimal class weights depend on the specific dataset and problem