The interaction_cst
parameter in scikit-learn’s HistGradientBoostingRegressor
allows you to control which features are allowed to interact in the trees.
Histogram-based Gradient Boosting is an efficient implementation of gradient boosting that uses binning to reduce training time and memory usage. It builds an ensemble of decision trees in a sequential manner, with each tree correcting the errors of the previous ones.
The interaction_cst
parameter defines interaction constraints between features. It allows you to specify which features are allowed to be used together for splitting nodes in the decision trees.
By default, interaction_cst
is set to None, which means all features can interact. You can specify constraints as a list of lists, where each sublist contains feature indices that are allowed to interact.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.metrics import mean_squared_error
# Generate synthetic dataset with interacting features
X, y = make_regression(n_samples=1000, n_features=5, n_informative=3,
noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different interaction_cst values
interaction_cst_values = [None, [[0, 1], [2, 3, 4]]]
mse_scores = []
for cst in interaction_cst_values:
model = HistGradientBoostingRegressor(interaction_cst=cst, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
mse_scores.append(mse)
print(f"interaction_cst={cst}, MSE: {mse:.3f}")
Running the example gives an output like:
interaction_cst=None, MSE: 50.119
interaction_cst=[[0, 1], [2, 3, 4]], MSE: 56.076
The key steps in this example are:
- Generate a synthetic regression dataset with potential feature interactions
- Split the data into train and test sets
- Train
HistGradientBoostingRegressor
models with differentinteraction_cst
values - Evaluate the mean squared error of each model on the test set
Tips for setting interaction_cst
:
- Use domain knowledge to determine which features should interact
- Start with no constraints and gradually add them to see the impact on model performance
- Consider the trade-off between model flexibility and interpretability
Issues to consider:
- Overly restrictive constraints may lead to underfitting
- Interaction constraints can impact model performance and training time
- The effectiveness of constraints depends on the true underlying relationships in the data