Nystroem
is an effective method for approximating a kernel map using a subset of data, making it computationally efficient for large datasets. It is commonly used for speeding up kernel methods in scikit-learn by approximating the feature space.
The key hyperparameters of Nystroem
include kernel
(type of kernel), n_components
(number of components), and gamma
(kernel coefficient).
This technique is appropriate for both classification and regression tasks.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.kernel_approximation import Nystroem
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
# generate binary classification dataset
X, y = make_classification(n_samples=100, n_features=20, n_classes=2, random_state=1)
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# create Nystroem kernel approximation
feature_map_nystroem = Nystroem(kernel='rbf', n_components=10, gamma=0.2)
# create logistic regression model
model = LogisticRegression()
# create a pipeline that first transforms the data then fits the model
pipeline = Pipeline([
('feature_map', feature_map_nystroem),
('logistic', model)
])
# fit the pipeline on the training data
pipeline.fit(X_train, y_train)
# evaluate the model
yhat = pipeline.predict(X_test)
acc = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % acc)
# make a prediction
row = [[0.5, 0.3, 1.0, -0.2, 0.1, 1.2, 0.8, -1.5, -0.3, 0.2, 0.4, 0.6, -0.1, 1.3, -0.7, 0.8, -0.6, -0.8, 0.5, 1.0]]
yhat = pipeline.predict(row)
print('Predicted: %d' % yhat[0])
Running the example gives an output like:
Accuracy: 0.400
Predicted: 1
The steps are as follows:
First, a synthetic binary classification dataset is generated using
make_classification()
. This creates a dataset with 100 samples, 20 features, and a fixed random seed for reproducibility. The dataset is split into training and test sets usingtrain_test_split()
.Next, a
Nystroem
object is instantiated to create a kernel approximation using the RBF kernel with 100 components and a gamma value of 0.2.A
LogisticRegression
model is created to perform the classification.A
Pipeline
is constructed, chaining theNystroem
transformer and theLogisticRegression
model.The pipeline is then fit on the training data.
The performance of the model is evaluated by predicting the test set and computing the accuracy score.
Finally, a single prediction is made by passing a new data sample to the
pipeline.predict()
method.
This example demonstrates how to use Nystroem
for kernel approximation to speed up kernel methods in scikit-learn, showcasing the integration of Nystroem
with a logistic regression model in a pipeline for binary classification tasks.