The n_neighbors
parameter in scikit-learn’s KNeighborsClassifier
controls the number of nearest neighbors used to make predictions.
K-Nearest Neighbors (KNN) is a non-parametric algorithm that classifies new data points based on their similarity to the training examples. It predicts the class of a new point by finding the k
closest training examples and taking a majority vote.
The n_neighbors
parameter sets the value of k
, i.e., how many neighbors are considered when making predictions. It has a significant impact on the model’s behavior and performance.
The default value for n_neighbors
is 5.
In practice, values between 3 and 15 are commonly used, depending on the characteristics of the dataset.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
# Generate synthetic dataset
X, y = make_classification(n_samples=200, n_features=2, n_redundant=0,
n_informative=2, n_clusters_per_class=1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different n_neighbors values
n_neighbors_values = [1, 3, 5, 10, 20]
accuracies = []
for k in n_neighbors_values:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracies.append(accuracy)
print(f"n_neighbors={k}, Accuracy: {accuracy:.3f}")
Running the example gives an output like:
n_neighbors=1, Accuracy: 0.775
n_neighbors=3, Accuracy: 0.925
n_neighbors=5, Accuracy: 0.875
n_neighbors=10, Accuracy: 0.850
n_neighbors=20, Accuracy: 0.875
The key steps in this example are:
- Generate a synthetic binary classification dataset with two informative features
- Split the data into train and test sets
- Train
KNeighborsClassifier
models with differentn_neighbors
values - Evaluate the accuracy of each model on the test set
Some tips and heuristics for setting n_neighbors
:
- Odd numbers prevent ties in binary classification
- Start with the default (5) and adjust up or down based on performance
- Lower values fit more complex decision boundaries but risk overfitting noise
- Higher values smooth the decision boundary but risk underfitting the data
Issues to consider:
- There is no universally optimal value, it depends on the dataset
- Very low values (e.g., 1) are prone to overfitting noise in the training data
- Very high values lose predictive power and underfit the data
- The impact of
n_neighbors
tends to diminish in higher-dimensional spaces