Scikit-Learn "GradientBoostingClassifier" versus "HistGradientBoostingClassifier"

GradientBoostingClassifier is a powerful algorithm used for classification tasks. Key hyperparameters include n_estimators (number of boosting stages), learning_rate (step size shrinkage), and max_depth (maximum depth of the individual trees).

HistGradientBoostingClassifier is a more recent addition to scikit-learn, designed to be more efficient with large datasets. Key hyperparameters include max_iter (number of boosting iterations), learning_rate, and max_depth.

The main difference is that HistGradientBoostingClassifier uses histogram-based techniques to speed up training and reduce memory usage, making it suitable for larger datasets.

GradientBoostingClassifier can be slower and more memory-intensive, but it’s well-understood and widely used. HistGradientBoostingClassifier offers speed and efficiency for larger datasets but may require more memory tuning.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier, HistGradientBoostingClassifier
from sklearn.metrics import accuracy_score, f1_score

# Generate synthetic binary classification dataset
X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.7, 0.3], random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit and evaluate GradientBoostingClassifier with default hyperparameters
gbc = GradientBoostingClassifier(random_state=42)
gbc.fit(X_train, y_train)
y_pred_gbc = gbc.predict(X_test)
print(f"GradientBoostingClassifier accuracy: {accuracy_score(y_test, y_pred_gbc):.3f}")
print(f"GradientBoostingClassifier F1 score: {f1_score(y_test, y_pred_gbc):.3f}")

# Fit and evaluate HistGradientBoostingClassifier with default hyperparameters
hgbc = HistGradientBoostingClassifier(random_state=42)
hgbc.fit(X_train, y_train)
y_pred_hgbc = hgbc.predict(X_test)
print(f"\nHistGradientBoostingClassifier accuracy: {accuracy_score(y_test, y_pred_hgbc):.3f}")
print(f"HistGradientBoostingClassifier F1 score: {f1_score(y_test, y_pred_hgbc):.3f}")

GradientBoostingClassifier accuracy: 0.910
GradientBoostingClassifier F1 score: 0.836

HistGradientBoostingClassifier accuracy: 0.910
HistGradientBoostingClassifier F1 score: 0.833

The steps are as follows:

Generate a synthetic binary classification dataset using make_classification.
Split the data into training and test sets using train_test_split.
Instantiate GradientBoostingClassifier with default hyperparameters, fit it on the training data, and evaluate its performance on the test set.
Instantiate HistGradientBoostingClassifier with default hyperparameters, fit it on the training data, and evaluate its performance on the test set.
Compare the test set performance (accuracy and F1 score) of both models.

See Also