Tuning `leaf_size` in scikit-learn’s `KNeighborsClassifier`

The leaf_size parameter in scikit-learn’s KNeighborsClassifier controls the leaf size of the ball tree or KD tree used for efficient neighbor search.

K-Nearest Neighbors (KNN) is a non-parametric algorithm that classifies new data points based on the majority class among the K nearest training examples. The leaf_size parameter determines the maximum number of data points in each leaf node of the tree structure used to speed up neighbor search.

Smaller values of leaf_size lead to deeper trees with more leaves, allowing for finer-grained searches but with higher memory usage. Larger values create shallower trees with fewer leaves, reducing memory consumption but potentially impacting search speed and model performance.

The default value for leaf_size is 30. In practice, values between 10 and 100 are commonly used depending on the size and dimensionality of the dataset.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
                           n_redundant=5, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different leaf_size values
leaf_size_values = [10, 30, 50, 100]
accuracies = []

for ls in leaf_size_values:
    knn = KNeighborsClassifier(n_neighbors=5, leaf_size=ls)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"leaf_size={ls}, Accuracy: {accuracy:.3f}")

Running the example gives an output like:

leaf_size=10, Accuracy: 0.905
leaf_size=30, Accuracy: 0.905
leaf_size=50, Accuracy: 0.905
leaf_size=100, Accuracy: 0.905

The key steps in this example are:

Generate a synthetic binary classification dataset with informative and redundant features
Split the data into train and test sets
Train KNeighborsClassifier models with different leaf_size values
Evaluate the accuracy of each model on the test set

Some tips and heuristics for setting leaf_size:

The default value of 30 works well for many datasets, but it’s worth tuning for optimal performance
Smaller values create deeper trees, which can improve search speed but consume more memory
Larger values create shallower trees, reducing memory usage but potentially impacting model performance

Issues to consider:

The optimal leaf_size depends on the size, dimensionality, and structure of the dataset
Very small values can lead to overly deep trees and excessive memory usage
Very large values can result in under-optimized trees and slower neighbor search
The impact of leaf_size on model performance is usually less significant than other parameters like n_neighbors

Configure KNeighborsClassifier "leaf_size" Parameter

Tuning `leaf_size` in scikit-learn’s `KNeighborsClassifier`

See Also

Tuning leaf_size in scikit-learn’s KNeighborsClassifier

See Also

Tuning `leaf_size` in scikit-learn’s `KNeighborsClassifier`