The C
parameter in scikit-learn’s SVC
class controls the regularization strength, which affects the balance between achieving a clean separation of classes and maximizing the margin.
Support Vector Machines (SVM) is a powerful algorithm for classification tasks. It tries to find the hyperplane that best separates the classes while maximizing the margin between the closest points from each class.
The C
parameter determines the penalty for misclassified samples. A smaller C
allows more misclassifications but may generalize better to unseen data. A larger C
heavily penalizes misclassifications, potentially leading to overfitting.
The default value for C
is 1.0.
In practice, C
values are typically chosen from a logarithmic scale, such as 0.1, 1, 10, 100.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=2, n_features=10,
n_informative=5, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different C values
C_values = [0.1, 1, 10, 100]
accuracies = []
for c in C_values:
svc = SVC(C=c, random_state=42)
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracies.append(accuracy)
print(f"C={c}, Accuracy: {accuracy:.3f}")
Running the example gives an output like:
C=0.1, Accuracy: 0.920
C=1, Accuracy: 0.920
C=10, Accuracy: 0.945
C=100, Accuracy: 0.960
The key steps in this example are:
- Generate a synthetic binary classification dataset with informative features
- Split the data into train and test sets
- Train
SVC
models with differentC
values - Evaluate the accuracy of each model on the test set
Some tips and heuristics for setting C
:
- Try a range of values on a logarithmic scale (e.g., 0.1, 1, 10, 100)
- Smaller
C
values provide more regularization, which can help with overfitting - Larger
C
values may fit the training data better but might not generalize well
Issues to consider:
- The optimal
C
value depends on the dataset and problem - Extremely large
C
values can cause overfitting to the training data - Very small
C
values may underfit and have poor performance on both train and test data