Configure LogisticRegression "C" Parameter

The C parameter in scikit-learn’s LogisticRegression controls the inverse of the regularization strength.

Logistic Regression is a linear model for binary classification that estimates class probabilities using the logistic function. The C parameter determines the strength of the L2 regularization applied to the model coefficients.

Smaller values of C specify stronger regularization, which can help prevent overfitting but may cause underfitting if too small. Conversely, larger values of C mean less regularization, potentially allowing a better fit to the training data but risking overfitting.

The default value for C is 1.0.

In practice, C values ranging from 0.01 to 100 are commonly tested to find the optimal regularization strength for a given dataset.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Generate synthetic dataset with some class overlap
X, y = make_classification(n_samples=1000, n_classes=2, n_features=10,
                           n_informative=5, n_redundant=0, n_clusters_per_class=2,
                           class_sep=1.5, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different C values
C_values = [0.01, 0.1, 1, 10, 100]
accuracies = []

for c in C_values:
    lr = LogisticRegression(C=c, random_state=42)
    lr.fit(X_train, y_train)
    y_pred = lr.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"C={c}, Accuracy: {accuracy:.3f}")

Running the example gives an output like:

C=0.01, Accuracy: 0.865
C=0.1, Accuracy: 0.860
C=1, Accuracy: 0.860
C=10, Accuracy: 0.855
C=100, Accuracy: 0.855

The key steps in this example are:

Generate a synthetic binary classification dataset with some class overlap
Split the data into train and test sets
Train LogisticRegression models with different C values
Evaluate the accuracy of each model on the test set

Some tips and heuristics for setting C:

Smaller C means more regularization, which can help prevent overfitting
Larger C means less regularization, potentially leading to better fit but risk of overfitting
Try a range of values from small to large and pick the value that performs best on a validation set

Issues to consider:

Very small C can cause underfitting
Very large C provides little to no regularization
The optimal C depends on the dataset, there is no universally best value
C should be tuned using cross-validation or a separate validation set

See Also