Scikit-Learn AdditiveChi2Sampler for Data Preparation

AdditiveChi2Sampler is a kernel approximation method in scikit-learn. It transforms the data to make it suitable for linear classifiers by approximating the chi-squared kernel.

Key parameters include sample_steps (number of sample steps) and sample_interval (sample interval).

The algorithm is appropriate for kernel approximation in classification problems involving text and image data.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.kernel_approximation import AdditiveChi2Sampler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score
import numpy as np

# generate synthetic dataset
X, y = make_classification(n_samples=100, n_features=20, random_state=1)
X = np.abs(X)

# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# create AdditiveChi2Sampler
chi2_feature = AdditiveChi2Sampler(sample_steps=2)

# create a pipeline with AdditiveChi2Sampler and LogisticRegression
model = make_pipeline(chi2_feature, LogisticRegression())

# fit model
model.fit(X_train, y_train)

# evaluate model
yhat = model.predict(X_test)
acc = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % acc)

# make a prediction
row = [[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]]
yhat = model.predict(row)
print('Predicted: %d' % yhat[0])

Running the example gives an output like:

Accuracy: 0.600
Predicted: 0

The steps are as follows:

Generate a synthetic dataset using make_classification() with 100 samples and 20 features. This simulates a classification problem with random data.
Split the dataset into training and test sets using train_test_split().
Create an AdditiveChi2Sampler instance with sample_steps=2 to approximate the chi-squared kernel.
Use make_pipeline() to create a pipeline that includes the AdditiveChi2Sampler and a LogisticRegression model.
Fit the pipeline model on the training data.
Evaluate the model by predicting the test data and calculating the accuracy score.
Make a prediction with the fitted model using a new data sample.

This example demonstrates how to use AdditiveChi2Sampler to transform data for efficient processing with a linear classifier, such as LogisticRegression. The pipeline simplifies the workflow, integrating both feature transformation and model training.

See Also