The C
parameter in scikit-learn’s LogisticRegression
controls the inverse of the regularization strength.
Logistic Regression is a linear model for binary classification that estimates class probabilities using the logistic function. The C
parameter determines the strength of the L2 regularization applied to the model coefficients.
Smaller values of C
specify stronger regularization, which can help prevent overfitting but may cause underfitting if too small. Conversely, larger values of C
mean less regularization, potentially allowing a better fit to the training data but risking overfitting.
The default value for C
is 1.0.
In practice, C
values ranging from 0.01 to 100 are commonly tested to find the optimal regularization strength for a given dataset.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Generate synthetic dataset with some class overlap
X, y = make_classification(n_samples=1000, n_classes=2, n_features=10,
n_informative=5, n_redundant=0, n_clusters_per_class=2,
class_sep=1.5, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different C values
C_values = [0.01, 0.1, 1, 10, 100]
accuracies = []
for c in C_values:
lr = LogisticRegression(C=c, random_state=42)
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracies.append(accuracy)
print(f"C={c}, Accuracy: {accuracy:.3f}")
Running the example gives an output like:
C=0.01, Accuracy: 0.865
C=0.1, Accuracy: 0.860
C=1, Accuracy: 0.860
C=10, Accuracy: 0.855
C=100, Accuracy: 0.855
The key steps in this example are:
- Generate a synthetic binary classification dataset with some class overlap
- Split the data into train and test sets
- Train
LogisticRegression
models with differentC
values - Evaluate the accuracy of each model on the test set
Some tips and heuristics for setting C
:
- Smaller
C
means more regularization, which can help prevent overfitting - Larger
C
means less regularization, potentially leading to better fit but risk of overfitting - Try a range of values from small to large and pick the value that performs best on a validation set
Issues to consider:
- Very small
C
can cause underfitting - Very large
C
provides little to no regularization - The optimal
C
depends on the dataset, there is no universally best value C
should be tuned using cross-validation or a separate validation set