Configure HistGradientBoostingRegressor "max_features" Parameter

The max_features parameter in scikit-learn’s HistGradientBoostingRegressor controls the fraction of features to consider when looking for the best split.

HistGradientBoostingRegressor is a gradient boosting algorithm that uses histogram-based decision trees. It’s designed for efficiency and can handle large datasets with high-dimensional features.

The max_features parameter determines the maximum number of features to consider when splitting a node. It’s expressed as a float between 0 and 1, representing the fraction of total features.

The default value for max_features is 1.0, which means all features are considered for each split. Common values range from 0.5 to 1.0, depending on the dataset’s characteristics.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.metrics import mean_squared_error

# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different max_features values
max_features_values = [0.5, 0.7, 0.9, 1.0]
mse_scores = []

for mf in max_features_values:
    hgbr = HistGradientBoostingRegressor(max_features=mf, random_state=42)
    hgbr.fit(X_train, y_train)
    y_pred = hgbr.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)
    print(f"max_features={mf}, MSE: {mse:.3f}")

Running the example gives an output like:

max_features=0.5, MSE: 2629.550
max_features=0.7, MSE: 2984.833
max_features=0.9, MSE: 3168.874
max_features=1.0, MSE: 3073.589

The key steps in this example are:

Generate a synthetic regression dataset with 20 features
Split the data into train and test sets
Train HistGradientBoostingRegressor models with different max_features values
Evaluate the Mean Squared Error (MSE) of each model on the test set

Some tips and heuristics for setting max_features:

Start with the default value of 1.0 and decrease it if overfitting occurs
Lower values can increase randomness and may help prevent overfitting
Higher values generally lead to better performance but may increase overfitting risk

Issues to consider:

The optimal value depends on the dataset’s characteristics and feature importance
Lower values may increase training time as more random feature subsets are evaluated
Very low values might lead to underfitting if important features are frequently excluded

See Also