The class_weight
parameter in scikit-learn’s HistGradientBoostingClassifier
helps address class imbalance issues in classification tasks.
HistGradientBoostingClassifier
is a fast, histogram-based implementation of gradient boosting. It builds an ensemble of decision trees sequentially, with each tree correcting errors made by the previous ones.
The class_weight
parameter adjusts the importance of classes during training. It can help the model pay more attention to minority classes, improving overall performance on imbalanced datasets.
By default, class_weight
is set to None
, treating all classes equally. Common options include ‘balanced’ (automatically adjusts weights inversely proportional to class frequencies) or a dictionary specifying custom weights for each class.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import f1_score
# Generate imbalanced dataset
X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.9, 0.1],
n_informative=3, n_redundant=1, flip_y=0, random_state=42)
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train models with different class_weight settings
class_weights = [None, 'balanced', {0: 1, 1: 9}]
for weights in class_weights:
clf = HistGradientBoostingClassifier(class_weight=weights, random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
f1 = f1_score(y_test, y_pred)
print(f"class_weight={weights}, F1 Score: {f1:.3f}")
Running this example produces output similar to:
class_weight=None, F1 Score: 0.852
class_weight=balanced, F1 Score: 0.836
class_weight={0: 1, 1: 9}, F1 Score: 0.836
Key steps in this example:
- Generate an imbalanced binary classification dataset
- Split the data into train and test sets
- Train
HistGradientBoostingClassifier
models with differentclass_weight
settings - Evaluate each model’s performance using F1 score
Tips for setting class_weight
:
- Use ‘balanced’ when you want automatic weight calculation
- Calculate custom weights as (1 - fraction_of_samples) for each class
- Always use cross-validation when tuning
class_weight
Issues to consider:
- Adjusting class weights may increase training time
- Extreme weight adjustments can lead to overfitting
- There’s often a trade-off between precision and recall when modifying class weights