Scikit-Learn "GradientBoostingRegressor" versus "HistGradientBoostingRegressor"

Gradient boosting is a powerful technique for regression tasks. In scikit-learn, GradientBoostingRegressor and HistGradientBoostingRegressor offer robust implementations of this technique, each with distinct advantages.

GradientBoostingRegressor uses traditional boosting techniques. Key hyperparameters include n_estimators (number of boosting stages), learning_rate (shrinkage factor), and max_depth (maximum depth of the individual regression estimators). This model is well-suited for smaller datasets where detailed parameter tuning can yield significant performance gains.

On the other hand, HistGradientBoostingRegressor employs histogram-based boosting, which is designed for efficiency and scalability. Key hyperparameters include max_iter (number of boosting iterations), learning_rate (shrinkage factor), and max_depth (maximum depth of the individual regression estimators). This model excels on larger datasets due to its faster computation.

The primary difference between the two is their computational efficiency. HistGradientBoostingRegressor is optimized for speed and can handle larger datasets more effectively, while GradientBoostingRegressor allows for more fine-tuned parameter adjustments on smaller datasets.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor, HistGradientBoostingRegressor
from sklearn.metrics import r2_score

# Generate synthetic regression dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit and evaluate GradientBoostingRegressor with default hyperparameters
gbr = GradientBoostingRegressor(random_state=42)
gbr.fit(X_train, y_train)
y_pred_gbr = gbr.predict(X_test)
print(f"GradientBoostingRegressor R^2 score: {r2_score(y_test, y_pred_gbr):.3f}")

# Fit and evaluate HistGradientBoostingRegressor with default hyperparameters
hgb = HistGradientBoostingRegressor(random_state=42)
hgb.fit(X_train, y_train)
y_pred_hgb = hgb.predict(X_test)
print(f"\nHistGradientBoostingRegressor R^2 score: {r2_score(y_test, y_pred_hgb):.3f}")

Running the example gives an output like:

GradientBoostingRegressor R^2 score: 0.921

HistGradientBoostingRegressor R^2 score: 0.921

The steps are as follows:

Generate a synthetic regression dataset using make_regression.
Split the data into training and test sets using train_test_split.
Instantiate GradientBoostingRegressor with default hyperparameters, fit it on the training data, and evaluate its performance on the test set.
Instantiate HistGradientBoostingRegressor with default hyperparameters, fit it on the training data, and evaluate its performance on the test set.
Compare the test set performance (R^2 score) of both models and discuss their computational efficiency.

See Also