The RandomForestClassifier
in scikit-learn is an ensemble learning algorithm that combines multiple decision trees to make predictions. It builds a forest of decision trees, each trained on a random subset of the data and features, and aggregates their predictions to make the final classification.
The feature_names_in_
attribute of a fitted RandomForestClassifier
is a list that stores the names of the input features used during fitting. This attribute is particularly useful when the input data is a pandas DataFrame with named columns, as it allows you to map the feature importances returned by the feature_importances_
attribute back to the original feature names.
Accessing the feature_names_in_
attribute can be helpful for interpreting and communicating the results of a trained RandomForestClassifier
. By combining the feature names with their corresponding importances, you can gain insights into which features have the most significant impact on the model’s predictions. This information can be valuable for feature selection, model interpretation, and explaining the model’s behavior to stakeholders.
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
# Generate a synthetic binary classification dataset with named features
X, y = make_classification(n_samples=1000, n_features=4, n_informative=2, n_redundant=0,
random_state=42, shuffle=False)
feature_names = [f'feature_{i}' for i in range(X.shape[1])]
# Convert the dataset to a pandas DataFrame with named columns
X_df = pd.DataFrame(X, columns=feature_names)
# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_df, y, test_size=0.2, random_state=42)
# Initialize and fit a RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
# Access the feature_names_in_ attribute and print the feature names
print(f"Feature names: {rf.feature_names_in_}")
# Get the feature importances and combine them with the feature names
importances = rf.feature_importances_
feature_importances = dict(zip(rf.feature_names_in_, importances))
print(f"Feature importances: {feature_importances}")
Running the example gives an output like:
Feature names: ['feature_0' 'feature_1' 'feature_2' 'feature_3']
Feature importances: {'feature_0': 0.22625340621203915, 'feature_1': 0.6545988761919419, 'feature_2': 0.061570172038100064, 'feature_3': 0.057577545557918754}
The key steps in this example are:
- Generate a synthetic binary classification dataset using
make_classification
and create a list of feature names. - Convert the dataset to a pandas DataFrame with the specified feature names.
- Split the data into train and test sets.
- Initialize a
RandomForestClassifier
and fit it on the training data. Since the input is a DataFrame with named columns, there’s no need to pass the feature names explicitly. - Access the
feature_names_in_
attribute to get the list of feature names used during fitting and print them. - Get the feature importances using the
feature_importances_
attribute and combine them with the feature names in a dictionary for easy interpretation.