RandomizedSearchCV is a useful tool in scikit-learn for performing hyperparameter optimization by sampling from distributions of hyperparameter values.
The feature_names_in_
attribute of a fitted RandomizedSearchCV object contains the feature names that were used during fitting. This attribute is useful for keeping track of the input features, especially when working with datasets that have many features or when feature selection techniques are applied.
Accessing feature_names_in_
allows you to retrieve the feature names for further analysis, feature importance calculations, or for creating visualizations and reports that require feature names.
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
import pandas as pd
# Generate a random classification dataset
X, y = make_classification(n_samples=100, n_features=10, n_informative=5, n_redundant=2, random_state=42)
# Create a DataFrame with named features
feature_names = [f'feature_{i}' for i in range(X.shape[1])]
X = pd.DataFrame(X, columns=feature_names)
# Set up a RandomForestClassifier
rf = RandomForestClassifier(random_state=42)
# Define hyperparameter distributions to sample from
param_dist = {
'n_estimators': randint(5, 10),
'max_depth': [3, 5, 10, None],
'min_samples_split': randint(2, 10),
}
# Run random search with 5-fold cross-validation
random_search = RandomizedSearchCV(rf, param_distributions=param_dist, n_iter=10, cv=5, random_state=42)
random_search.fit(X, y)
# Access feature_names_in_ attribute
feature_names_in = random_search.feature_names_in_
# Print feature names
print("Feature names:")
print(feature_names_in)
Running the example gives an output like:
Feature names:
['feature_0' 'feature_1' 'feature_2' 'feature_3' 'feature_4' 'feature_5'
'feature_6' 'feature_7' 'feature_8' 'feature_9']
The example can be broken down into the following steps:
- Generate a synthetic classification dataset using
make_classification
and create a pandas DataFrame with named features. - Set up a
RandomForestClassifier
and define distributions to sample hyperparameters from. - Run
RandomizedSearchCV
with the classifier, hyperparameter distributions, 10 iterations, and 5-fold cross-validation. - After fitting, access the
feature_names_in_
attribute from therandom_search
object. - Print the
feature_names_in_
attribute to display the feature names used during fitting.
By retrieving the feature_names_in_
attribute, you can easily access the names of the features used in the RandomizedSearchCV process, which can be helpful for further analysis and interpretation of the results.