The p
parameter in scikit-learn’s KNeighborsClassifier
controls the distance metric used to find the nearest neighbors.
K-Nearest Neighbors (k-NN) is a non-parametric classification algorithm that predicts the class of a data point based on the majority class of its k nearest neighbors.
The p
parameter determines the distance metric:
p=1
corresponds to Manhattan distancep=2
(default) corresponds to Euclidean distancep>2
corresponds to Minkowski distance
The choice of p
affects how the feature space is measured and can impact the model’s performance, particularly in high-dimensional spaces.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=10, n_classes=3,
n_informative=5, n_redundant=0, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different p values
p_values = [1, 2, 3]
accuracies = []
for p in p_values:
knn = KNeighborsClassifier(n_neighbors=5, p=p)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracies.append(accuracy)
print(f"p={p}, Accuracy: {accuracy:.3f}")
Running the example gives an output like:
p=1, Accuracy: 0.855
p=2, Accuracy: 0.800
p=3, Accuracy: 0.810
The key steps in this example are:
- Generate a synthetic multi-class classification dataset
- Split the data into train and test sets
- Train
KNeighborsClassifier
models with differentp
values - Evaluate the accuracy of each model on the test set
Tips and heuristics for setting p
:
- Manhattan distance (
p=1
) is less affected by outliers compared to Euclidean distance (p=2
) - Higher values of
p
give more weight to larger differences in individual features - The optimal
p
value depends on the geometry of the feature space
Issues to consider:
- Higher
p
values are more computationally expensive - The effect of
p
interacts with then_neighbors
parameter - The impact of
p
is more pronounced in high-dimensional spaces