The degree
parameter in scikit-learn’s SVC
class controls the complexity of the decision boundary when using a polynomial kernel.
Support Vector Machines (SVMs) are powerful algorithms for classification and regression tasks. The SVC
class in scikit-learn implements Support Vector Classification, which can handle non-linearly separable data by using kernel functions to transform the input space.
The degree
parameter is specific to the polynomial kernel, which allows for learning non-linear decision boundaries. It determines the degree of the polynomial used to transform the input features.
The default value for degree
is 3.
In practice, values between 2 and 5 are commonly used depending on the complexity of the dataset.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5,
n_redundant=0, n_clusters_per_class=1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different degree values
degree_values = [2, 3, 4, 5]
accuracies = []
for d in degree_values:
svc = SVC(kernel='poly', degree=d, random_state=42)
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracies.append(accuracy)
print(f"degree={d}, Accuracy: {accuracy:.3f}")
Running the example gives an output like:
degree=2, Accuracy: 0.960
degree=3, Accuracy: 0.970
degree=4, Accuracy: 0.935
degree=5, Accuracy: 0.965
The key steps in this example are:
- Generate a synthetic binary classification dataset with informative and redundant features
- Split the data into train and test sets
- Train
SVC
models with differentdegree
values using a polynomial kernel - Evaluate the accuracy of each model on the test set
Some tips and heuristics for setting degree
:
- Start with the default value of 3 and try values from 2 to 5
- Higher degree leads to more complex decision boundaries, which can capture intricate patterns
- Use cross-validation to select the optimal degree value for your dataset
Issues to consider:
- Setting the degree too high can lead to overfitting, especially on small datasets
- The computational cost increases with higher degree values
- The polynomial kernel may not be suitable for very high-dimensional data