Scikit-Learn OneHotEncoder for Data Preprocessing

OneHotEncoder is a preprocessing tool in scikit-learn that converts categorical data into a format suitable for machine learning algorithms by encoding categorical features as a one-hot numeric array.

The key hyperparameters of OneHotEncoder include categories (specifies categories for each feature), drop (specifies whether to drop one of the categories), and sparse_output (determines whether the output should be sparse or dense).

This encoder is appropriate for transforming categorical data in both classification and regression problems.

from sklearn.preprocessing import OneHotEncoder
import numpy as np

# Sample categorical data
data = np.array([
    ['red', 'small'],
    ['blue', 'large'],
    ['green', 'medium'],
    ['blue', 'small'],
    ['red', 'large']
])

# Create the OneHotEncoder
encoder = OneHotEncoder()

# Fit and transform the data
encoded_data = encoder.fit_transform(data)

# Convert to array for better readability
encoded_data_array = encoded_data.toarray()

print(encoded_data_array)

Running the example gives an output like:

[[0. 0. 1. 0. 0. 1.]
 [1. 0. 0. 1. 0. 0.]
 [0. 1. 0. 0. 1. 0.]
 [1. 0. 0. 0. 0. 1.]
 [0. 0. 1. 1. 0. 0.]]

The steps are as follows:

A sample categorical dataset is created using NumPy. This dataset contains two categorical features with different categories.
A OneHotEncoder instance is created with default hyperparameters.
The encoder is fit on the sample data and transforms the data into a one-hot encoded format using the fit_transform() method.
The output is converted to an array format for better readability, showcasing the one-hot encoded representation of the original categorical data.

See Also