Logistic regression is a widely used algorithm for binary classification tasks.
In scikit-learn, the LogisticRegression
class provides an implementation of this algorithm. However, to achieve optimal performance, the model’s hyperparameters often need to be tuned. This is where LogisticRegressionCV
comes in, offering the same functionality as LogisticRegression
but with built-in cross-validation for automated hyperparameter tuning.
LogisticRegression
has several key hyperparameters, including C
(inverse of regularization strength), penalty
(type of regularization), and solver
(optimization algorithm). Tuning these manually can be time-consuming and requires domain knowledge.
On the other hand, LogisticRegressionCV
automates the hyperparameter search process using cross-validation. Its key hyperparameters include Cs
(list of C
values to try), cv
(number of folds for cross-validation), and scoring
(metric to optimize).
The main difference is that LogisticRegressionCV
automates the hyperparameter tuning process, while LogisticRegression
requires manual tuning. However, this automation comes at a computational cost, as LogisticRegressionCV
trains multiple models during cross-validation.
LogisticRegression
is ideal for quick prototyping or when you have prior knowledge of good hyperparameter values. LogisticRegressionCV
is preferred when you need to tune hyperparameters and perform model selection, especially with new datasets.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.metrics import accuracy_score, f1_score
# Generate synthetic binary classification dataset
X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.8, 0.2], random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Fit and evaluate LogisticRegression with default hyperparameters
lr = LogisticRegression(random_state=42)
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)
print(f"LogisticRegression accuracy: {accuracy_score(y_test, y_pred_lr):.3f}")
print(f"LogisticRegression F1 score: {f1_score(y_test, y_pred_lr):.3f}")
# Fit and evaluate LogisticRegressionCV with cross-validation
lrcv = LogisticRegressionCV(cv=5, random_state=42)
lrcv.fit(X_train, y_train)
y_pred_lrcv = lrcv.predict(X_test)
print(f"\nLogisticRegressionCV accuracy: {accuracy_score(y_test, y_pred_lrcv):.3f}")
print(f"LogisticRegressionCV F1 score: {f1_score(y_test, y_pred_lrcv):.3f}")
print(f"Best hyperparameters: {lrcv.get_params()}")
Running the example gives an output like:
LogisticRegression accuracy: 0.850
LogisticRegression F1 score: 0.595
LogisticRegressionCV accuracy: 0.845
LogisticRegressionCV F1 score: 0.587
Best hyperparameters: {'Cs': 10, 'class_weight': None, 'cv': 5, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1.0, 'l1_ratios': None, 'max_iter': 100, 'multi_class': 'deprecated', 'n_jobs': None, 'penalty': 'l2', 'random_state': 42, 'refit': True, 'scoring': None, 'solver': 'lbfgs', 'tol': 0.0001, 'verbose': 0}
The steps are as follows:
- Generate a synthetic imbalanced binary classification dataset using
make_classification
. - Split the data into training and test sets using
train_test_split
. - Instantiate
LogisticRegression
with default hyperparameters, fit it on the training data, and evaluate its performance on the test set. - Instantiate
LogisticRegressionCV
with 5-fold cross-validation, fit it on the training data, and evaluate its performance on the test set. - Compare the test set performance (accuracy and F1 score) of both models and print the best hyperparameters found by
LogisticRegressionCV
.