Scikit-Learn "LogisticRegression" versus "LogisticRegressionCV"

Logistic regression is a widely used algorithm for binary classification tasks.

In scikit-learn, the LogisticRegression class provides an implementation of this algorithm. However, to achieve optimal performance, the model’s hyperparameters often need to be tuned. This is where LogisticRegressionCV comes in, offering the same functionality as LogisticRegression but with built-in cross-validation for automated hyperparameter tuning.

LogisticRegression has several key hyperparameters, including C (inverse of regularization strength), penalty (type of regularization), and solver (optimization algorithm). Tuning these manually can be time-consuming and requires domain knowledge.

On the other hand, LogisticRegressionCV automates the hyperparameter search process using cross-validation. Its key hyperparameters include Cs (list of C values to try), cv (number of folds for cross-validation), and scoring (metric to optimize).

The main difference is that LogisticRegressionCV automates the hyperparameter tuning process, while LogisticRegression requires manual tuning. However, this automation comes at a computational cost, as LogisticRegressionCV trains multiple models during cross-validation.

LogisticRegression is ideal for quick prototyping or when you have prior knowledge of good hyperparameter values. LogisticRegressionCV is preferred when you need to tune hyperparameters and perform model selection, especially with new datasets.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.metrics import accuracy_score, f1_score

# Generate synthetic binary classification dataset
X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.8, 0.2], random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit and evaluate LogisticRegression with default hyperparameters
lr = LogisticRegression(random_state=42)
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)
print(f"LogisticRegression accuracy: {accuracy_score(y_test, y_pred_lr):.3f}")
print(f"LogisticRegression F1 score: {f1_score(y_test, y_pred_lr):.3f}")

# Fit and evaluate LogisticRegressionCV with cross-validation
lrcv = LogisticRegressionCV(cv=5, random_state=42)
lrcv.fit(X_train, y_train)
y_pred_lrcv = lrcv.predict(X_test)
print(f"\nLogisticRegressionCV accuracy: {accuracy_score(y_test, y_pred_lrcv):.3f}")
print(f"LogisticRegressionCV F1 score: {f1_score(y_test, y_pred_lrcv):.3f}")
print(f"Best hyperparameters: {lrcv.get_params()}")

Running the example gives an output like:

LogisticRegression accuracy: 0.850
LogisticRegression F1 score: 0.595

LogisticRegressionCV accuracy: 0.845
LogisticRegressionCV F1 score: 0.587
Best hyperparameters: {'Cs': 10, 'class_weight': None, 'cv': 5, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1.0, 'l1_ratios': None, 'max_iter': 100, 'multi_class': 'deprecated', 'n_jobs': None, 'penalty': 'l2', 'random_state': 42, 'refit': True, 'scoring': None, 'solver': 'lbfgs', 'tol': 0.0001, 'verbose': 0}

The steps are as follows:

Generate a synthetic imbalanced binary classification dataset using make_classification.
Split the data into training and test sets using train_test_split.
Instantiate LogisticRegression with default hyperparameters, fit it on the training data, and evaluate its performance on the test set.
Instantiate LogisticRegressionCV with 5-fold cross-validation, fit it on the training data, and evaluate its performance on the test set.
Compare the test set performance (accuracy and F1 score) of both models and print the best hyperparameters found by LogisticRegressionCV.

See Also