SKLearner Home | About | Contact | Examples

Scikit-Learn DictVectorizer for Feature Extraction

DictVectorizer is a powerful tool for converting feature dictionaries into NumPy arrays.

It is particularly useful for preparing data with categorical features for machine learning algorithms.

The vectorizer can handle both categorical and numerical data simultaneously, making it a versatile tool for preprocessing features in various problem types, such as classification and regression.

from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import train_test_split

# create a dataset with categorical and numerical features
data = [
    {'feature1': 'cat', 'feature2': 1},
    {'feature1': 'dog', 'feature2': 2},
    {'feature1': 'cat', 'feature2': 3},
    {'feature1': 'dog', 'feature2': 4}
]
labels = [0, 1, 0, 1]

# split into train and test sets
data_train, data_test, labels_train, labels_test = train_test_split(data, labels, test_size=0.5, random_state=1)

# create a DictVectorizer
vectorizer = DictVectorizer(sparse=False)

# fit the vectorizer on the training data
vectorizer.fit(data_train)

# transform both the training and test sets
data_train_transformed = vectorizer.transform(data_train)
data_test_transformed = vectorizer.transform(data_test)

# show the transformed data
print(data_train_transformed)
print(data_test_transformed)

Running the example gives an output like:

[[1. 0. 1.]
 [0. 1. 2.]]
[[0. 1. 4.]
 [1. 0. 3.]]

The steps are as follows:

  1. First, a synthetic dataset containing categorical and numerical features is generated. The features are represented as dictionaries. This dataset is then split into training and test sets using train_test_split().

  2. Next, a DictVectorizer instance is created to handle the feature transformation. The vectorizer is fit on the training data using the fit() method, which learns the feature mappings.

  3. The training and test datasets are transformed into numerical arrays using the transform() method of the fitted vectorizer. This transformation converts the categorical features into numerical format suitable for machine learning algorithms.

  4. A sample of the transformed data is displayed to illustrate the outcome of using DictVectorizer.

This example demonstrates how to use DictVectorizer to convert a dataset with mixed feature types into a format suitable for machine learning algorithms. This tool is especially useful when dealing with datasets that include categorical variables that need to be encoded numerically.



See Also