The n_jobs
parameter in scikit-learn’s KNeighborsClassifier
controls the number of CPU cores used for the computation.
KNeighborsClassifier
is a simple, instance-based learning algorithm used for classification tasks. It classifies a data point based on the majority class among its k-nearest neighbors in the feature space.
The n_jobs
parameter specifies the number of CPU cores to use for the computation. It accepts integer values, where -1
uses all available cores.
The default value for n_jobs
is None
, which means using a single core.
In practice, common values for n_jobs
include 1
, 2
, and -1
(using all cores), depending on the computational resources and the size of the dataset.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import time
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
n_redundant=5, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different n_jobs values
n_jobs_values = [1, 2, -1]
results = []
for n in n_jobs_values:
knn = KNeighborsClassifier(n_jobs=n)
start_time = time.time()
knn.fit(X_train, y_train)
end_time = time.time()
y_pred = knn.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
elapsed_time = end_time - start_time
results.append((n, accuracy, elapsed_time))
print(f"n_jobs={n}, Accuracy: {accuracy:.3f}, Training Time: {elapsed_time:.2f} seconds")
Running the example gives an output like:
n_jobs=1, Accuracy: 0.905, Training Time: 0.00 seconds
n_jobs=2, Accuracy: 0.905, Training Time: 0.00 seconds
n_jobs=-1, Accuracy: 0.905, Training Time: 0.00 seconds
The key steps in this example are:
- Generate a synthetic binary classification dataset with informative and redundant features.
- Split the data into train and test sets.
- Train
KNeighborsClassifier
models with differentn_jobs
values. - Evaluate the accuracy and training time of each model on the test set.
Some tips and heuristics for setting n_jobs
:
- Set
n_jobs
to-1
to utilize all available CPU cores for faster computation. - For smaller datasets or when computational resources are limited, use
n_jobs=1
. - Monitor the system’s CPU usage to ensure optimal resource utilization.
Issues to consider:
- The benefit of parallel processing depends on the size and complexity of the dataset.
- Using all available cores (
n_jobs=-1
) can lead to resource contention in multi-user environments. - Some environments (e.g., certain cloud services) may limit the number of cores available for parallel processing.