Scikit-Learn RBFSampler for Data Preparation

RBFSampler is a tool for approximating the Radial Basis Function (RBF) kernel. It maps the data into a higher-dimensional space, making it possible to apply linear algorithms to non-linear problems.

The primary hyperparameter of RBFSampler is gamma, which defines the bandwidth of the RBF kernel. This technique is useful for preprocessing in both classification and regression tasks where non-linear relationships are present.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.kernel_approximation import RBFSampler
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score

# generate a synthetic dataset
X, y = make_classification(n_samples=100, n_features=20, random_state=1)

# split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# create an RBFSampler with a specific gamma value
rbf_sampler = RBFSampler(gamma=1.0, random_state=1)

# create a linear classifier pipeline with RBFSampler
model = make_pipeline(rbf_sampler, SGDClassifier(random_state=1))

# fit the model on the training data
model.fit(X_train, y_train)

# evaluate the model
yhat = model.predict(X_test)
acc = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % acc)

# make a prediction
row = [X_test[0]]
yhat = model.predict(row)
print('Predicted: %d' % yhat[0])

Running the example gives an output like:

Accuracy: 0.500
Predicted: 0

Generate a synthetic classification dataset using make_classification(). This creates a dataset with a specified number of samples (n_samples) and features (n_features) with a fixed random seed (random_state) for reproducibility. Split the dataset into training and test sets using train_test_split().
Create an RBFSampler instance with a specified gamma value. This approximates the RBF kernel by mapping the input data into a higher-dimensional space.
Create a pipeline using make_pipeline(), combining RBFSampler and SGDClassifier, which is a linear classifier.
Fit the pipeline on the training data using the fit() method.
Evaluate the model by predicting on the test set and calculating the accuracy score with accuracy_score().
Make a single prediction using the trained model by passing a sample from the test set to the predict() method.

This example illustrates how to use RBFSampler for kernel approximation in scikit-learn, enabling the use of linear classifiers on non-linear data by mapping it to a higher-dimensional space.

See Also