Scikit-Learn PolynomialFeatures for Data Preprocessing

Generate polynomial and interaction features to improve model performance on complex datasets.

PolynomialFeatures generates new features representing all polynomial combinations of the original features up to a specified degree.

Key hyperparameters of PolynomialFeatures include the degree (degree of polynomial features), interaction_only (if only interaction features are included), and include_bias (if a bias column is included).

This technique is appropriate for feature engineering in regression and classification problems where interactions and polynomial terms may enhance model performance.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error

# generate regression dataset
X, y = make_regression(n_samples=100, n_features=2, noise=0.1, random_state=1)

# create polynomial features
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)

# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_poly, y, test_size=0.2, random_state=1)

# create model
model = LinearRegression()

# fit model
model.fit(X_train, y_train)

# evaluate model
yhat = model.predict(X_test)
mse = mean_squared_error(y_test, yhat)
print('Mean Squared Error: %.3f' % mse)

# make a prediction
row = [[-0.6, 0.7]]
row_poly = poly.transform(row)
yhat = model.predict(row_poly)
print('Predicted: %.3f' % yhat[0])

Running the example gives an output like:

Mean Squared Error: 0.017
Predicted: 41.150

The steps are as follows:

Generate a synthetic regression dataset using make_regression(). This creates a dataset with a specified number of samples (n_samples), features (n_features), and a fixed random seed (random_state) for reproducibility.
Create polynomial features up to the second degree using PolynomialFeatures with include_bias=False. This transformation adds polynomial terms to the feature set, which can capture more complex relationships in the data.
Split the dataset into training and test sets using train_test_split(). This allows for model evaluation on unseen data to assess performance.
Instantiate and fit a LinearRegression model on the transformed training data. The fit() method trains the model.
Evaluate the model performance using mean squared error on the test set. This metric quantifies the difference between the actual and predicted values.
Make a prediction using the fit model on a new data sample, demonstrating the transformation and prediction steps.

This example shows how to enhance feature sets with polynomial terms using PolynomialFeatures, which can improve the performance of linear models on complex datasets.

See Also