The class_weight
parameter in scikit-learn’s LogisticRegression
controls the weighting of classes in the classification algorithm.
Logistic Regression is a linear model for binary classification that predicts the probability of a given input belonging to a certain class. The class_weight
parameter helps in handling imbalanced datasets by adjusting the weights assigned to each class, which ensures that the model is not biased towards the majority class.
The default value for class_weight
is None
, meaning no class weighting. Common values include balanced
, which adjusts weights inversely proportional to class frequencies, or manually specified weights.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Generate synthetic imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=2,
n_redundant=10, n_clusters_per_class=1,
weights=[0.9, 0.1], flip_y=0, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different class_weight values
class_weight_values = [None, 'balanced', {0: 0.5, 1: 4}]
accuracies = []
for cw in class_weight_values:
lr = LogisticRegression(class_weight=cw, random_state=42)
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracies.append(accuracy)
print(f"class_weight={cw}, Accuracy: {accuracy:.3f}")
Running the example gives an output like:
class_weight=None, Accuracy: 0.995
class_weight=balanced, Accuracy: 0.965
class_weight={0: 0.5, 1: 4}, Accuracy: 0.965
The key steps in this example are:
- Generate a synthetic binary classification dataset with an imbalanced class distribution.
- Split the data into train and test sets.
- Train
LogisticRegression
models with differentclass_weight
values. - Evaluate the accuracy of each model on the test set.
Some tips and heuristics for setting class_weight
:
- Use
class_weight='balanced'
to automatically adjust weights inversely proportional to class frequencies. - Manually specify weights if you have domain knowledge about the importance of each class.
- Be cautious of overcompensating for imbalance, which can lead to overfitting.
Issues to consider:
- The optimal
class_weight
settings depend on the degree of imbalance and the specific problem. - Using
class_weight='balanced'
is a good starting point for most imbalanced datasets. - Manually tuning weights might be necessary for highly imbalanced datasets or specific use cases.