The max_features
parameter in scikit-learn’s HistGradientBoostingRegressor
controls the fraction of features to consider when looking for the best split.
HistGradientBoostingRegressor
is a gradient boosting algorithm that uses histogram-based decision trees. It’s designed for efficiency and can handle large datasets with high-dimensional features.
The max_features
parameter determines the maximum number of features to consider when splitting a node. It’s expressed as a float between 0 and 1, representing the fraction of total features.
The default value for max_features
is 1.0, which means all features are considered for each split. Common values range from 0.5 to 1.0, depending on the dataset’s characteristics.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.metrics import mean_squared_error
# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different max_features values
max_features_values = [0.5, 0.7, 0.9, 1.0]
mse_scores = []
for mf in max_features_values:
hgbr = HistGradientBoostingRegressor(max_features=mf, random_state=42)
hgbr.fit(X_train, y_train)
y_pred = hgbr.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
mse_scores.append(mse)
print(f"max_features={mf}, MSE: {mse:.3f}")
Running the example gives an output like:
max_features=0.5, MSE: 2629.550
max_features=0.7, MSE: 2984.833
max_features=0.9, MSE: 3168.874
max_features=1.0, MSE: 3073.589
The key steps in this example are:
- Generate a synthetic regression dataset with 20 features
- Split the data into train and test sets
- Train
HistGradientBoostingRegressor
models with differentmax_features
values - Evaluate the Mean Squared Error (MSE) of each model on the test set
Some tips and heuristics for setting max_features
:
- Start with the default value of 1.0 and decrease it if overfitting occurs
- Lower values can increase randomness and may help prevent overfitting
- Higher values generally lead to better performance but may increase overfitting risk
Issues to consider:
- The optimal value depends on the dataset’s characteristics and feature importance
- Lower values may increase training time as more random feature subsets are evaluated
- Very low values might lead to underfitting if important features are frequently excluded