The class_weight
parameter in scikit-learn’s SVC
allows addressing class imbalance by assigning higher weights to the minority class.
Support Vector Machines (SVMs) are powerful classifiers, but they can be sensitive to class imbalance. When one class has significantly fewer instances than the other, the model may be biased towards the majority class.
The class_weight
parameter helps mitigate this issue by giving more importance to the minority class during training. It accepts a dictionary that maps class labels to their corresponding weights.
The default value for class_weight
is None
, which assigns equal weight to all classes. Another common value is 'balanced'
, which automatically sets class weights inversely proportional to their frequencies.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import f1_score
# Generate imbalanced synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.9, 0.1],
random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different class_weight values
class_weight_values = [None, 'balanced', {0: 1, 1: 10}]
f1_scores = []
for cw in class_weight_values:
svc = SVC(kernel='linear', class_weight=cw, random_state=42)
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)
f1 = f1_score(y_test, y_pred)
f1_scores.append(f1)
print(f"class_weight={cw}, F1-score: {f1:.3f}")
Running the example gives an output like:
class_weight=None, F1-score: 0.353
class_weight=balanced, F1-score: 0.536
class_weight={0: 1, 1: 10}, F1-score: 0.500
The key steps in this example are:
- Generate an imbalanced synthetic binary classification dataset
- Split the data into train and test sets
- Train
SVC
models with differentclass_weight
values - Evaluate the models using the F1-score metric
Some tips and heuristics for setting class_weight
:
- Use
'balanced'
as a quick way to assign class weights inversely proportional to their frequencies - For more control, provide a dictionary with class labels as keys and corresponding weights as values
- Higher weights for the minority class can improve its recall, but may decrease precision
Issues to consider:
- The optimal
class_weight
depends on the specific dataset and the desired trade-off between precision and recall - Extreme class imbalance may require additional techniques like oversampling or undersampling
- Evaluating the model using metrics that account for class imbalance, such as F1-score, precision-recall curves, or ROC AUC