Scikit-Learn Nystroem for Data Preparation

Nystroem is an effective method for approximating a kernel map using a subset of data, making it computationally efficient for large datasets. It is commonly used for speeding up kernel methods in scikit-learn by approximating the feature space.

The key hyperparameters of Nystroem include kernel (type of kernel), n_components (number of components), and gamma (kernel coefficient).

This technique is appropriate for both classification and regression tasks.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.kernel_approximation import Nystroem
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# generate binary classification dataset
X, y = make_classification(n_samples=100, n_features=20, n_classes=2, random_state=1)

# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# create Nystroem kernel approximation
feature_map_nystroem = Nystroem(kernel='rbf', n_components=10, gamma=0.2)

# create logistic regression model
model = LogisticRegression()

# create a pipeline that first transforms the data then fits the model
pipeline = Pipeline([
    ('feature_map', feature_map_nystroem),
    ('logistic', model)
])

# fit the pipeline on the training data
pipeline.fit(X_train, y_train)

# evaluate the model
yhat = pipeline.predict(X_test)
acc = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % acc)

# make a prediction
row = [[0.5, 0.3, 1.0, -0.2, 0.1, 1.2, 0.8, -1.5, -0.3, 0.2, 0.4, 0.6, -0.1, 1.3, -0.7, 0.8, -0.6, -0.8, 0.5, 1.0]]
yhat = pipeline.predict(row)
print('Predicted: %d' % yhat[0])

Running the example gives an output like:

Accuracy: 0.400
Predicted: 1

The steps are as follows:

First, a synthetic binary classification dataset is generated using make_classification(). This creates a dataset with 100 samples, 20 features, and a fixed random seed for reproducibility. The dataset is split into training and test sets using train_test_split().
Next, a Nystroem object is instantiated to create a kernel approximation using the RBF kernel with 100 components and a gamma value of 0.2.
A LogisticRegression model is created to perform the classification.
A Pipeline is constructed, chaining the Nystroem transformer and the LogisticRegression model.
The pipeline is then fit on the training data.
The performance of the model is evaluated by predicting the test set and computing the accuracy score.
Finally, a single prediction is made by passing a new data sample to the pipeline.predict() method.

This example demonstrates how to use Nystroem for kernel approximation to speed up kernel methods in scikit-learn, showcasing the integration of Nystroem with a logistic regression model in a pipeline for binary classification tasks.

See Also