Scikit-Learn OrdinalEncoder for Data Preprocessing

OrdinalEncoder is used to transform categorical features into integer values, preserving their ordinal relationship. This is essential for machine learning models that require numerical input.

The key parameter of OrdinalEncoder is categories, which defines the order of categories to be encoded. If set to auto, the encoder determines the categories from the training data.

OrdinalEncoder is suitable for preprocessing steps in various machine learning tasks, including classification and regression.

from sklearn.preprocessing import OrdinalEncoder
import numpy as np

# Sample data with categorical features
data = np.array([['low'], ['medium'], ['high'], ['medium'], ['low']])

# Initialize OrdinalEncoder
encoder = OrdinalEncoder()

# Fit and transform the data
encoded_data = encoder.fit_transform(data)

print("Encoded Data:\n", encoded_data)

# Inverse transform to check the original values
original_data = encoder.inverse_transform(encoded_data)
print("Original Data:\n", original_data)

Running the example gives an output like:

Encoded Data:
 [[1.]
 [2.]
 [0.]
 [2.]
 [1.]]
Original Data:
 [['low']
 ['medium']
 ['high']
 ['medium']
 ['low']]

The steps are as follows:

A synthetic dataset with a single categorical feature is created using numpy. This dataset includes ordinal categories such as ’low’, ‘medium’, and ‘high’.
The OrdinalEncoder class is instantiated. No parameters are specified, so it will automatically determine the categories.
The fit_transform() method is used to convert the categorical values into numerical values. Each unique category is assigned a unique integer.
The inverse_transform() method reverts the encoded values back to their original categories to verify the encoding.

This example demonstrates how to use OrdinalEncoder to convert categorical features into numerical values. The encoded values preserve the original order of the categories, which can be critical for some machine learning models.

See Also