Configure KNeighborsClassifier "n_neighbors" Parameter

The n_neighbors parameter in scikit-learn’s KNeighborsClassifier controls the number of nearest neighbors used to make predictions.

K-Nearest Neighbors (KNN) is a non-parametric algorithm that classifies new data points based on their similarity to the training examples. It predicts the class of a new point by finding the k closest training examples and taking a majority vote.

The n_neighbors parameter sets the value of k, i.e., how many neighbors are considered when making predictions. It has a significant impact on the model’s behavior and performance.

The default value for n_neighbors is 5.

In practice, values between 3 and 15 are commonly used, depending on the characteristics of the dataset.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=200, n_features=2, n_redundant=0,
                           n_informative=2, n_clusters_per_class=1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different n_neighbors values
n_neighbors_values = [1, 3, 5, 10, 20]
accuracies = []

for k in n_neighbors_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"n_neighbors={k}, Accuracy: {accuracy:.3f}")

Running the example gives an output like:

n_neighbors=1, Accuracy: 0.775
n_neighbors=3, Accuracy: 0.925
n_neighbors=5, Accuracy: 0.875
n_neighbors=10, Accuracy: 0.850
n_neighbors=20, Accuracy: 0.875

The key steps in this example are:

Generate a synthetic binary classification dataset with two informative features
Split the data into train and test sets
Train KNeighborsClassifier models with different n_neighbors values
Evaluate the accuracy of each model on the test set

Some tips and heuristics for setting n_neighbors:

Odd numbers prevent ties in binary classification
Start with the default (5) and adjust up or down based on performance
Lower values fit more complex decision boundaries but risk overfitting noise
Higher values smooth the decision boundary but risk underfitting the data

Issues to consider:

There is no universally optimal value, it depends on the dataset
Very low values (e.g., 1) are prone to overfitting noise in the training data
Very high values lose predictive power and underfit the data
The impact of n_neighbors tends to diminish in higher-dimensional spaces

See Also