GradientBoostingClassifier is a powerful algorithm used for classification tasks. Key hyperparameters include n_estimators
(number of boosting stages), learning_rate
(step size shrinkage), and max_depth
(maximum depth of the individual trees).
HistGradientBoostingClassifier is a more recent addition to scikit-learn, designed to be more efficient with large datasets. Key hyperparameters include max_iter
(number of boosting iterations), learning_rate
, and max_depth
.
The main difference is that HistGradientBoostingClassifier uses histogram-based techniques to speed up training and reduce memory usage, making it suitable for larger datasets.
GradientBoostingClassifier can be slower and more memory-intensive, but it’s well-understood and widely used. HistGradientBoostingClassifier offers speed and efficiency for larger datasets but may require more memory tuning.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier, HistGradientBoostingClassifier
from sklearn.metrics import accuracy_score, f1_score
# Generate synthetic binary classification dataset
X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.7, 0.3], random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Fit and evaluate GradientBoostingClassifier with default hyperparameters
gbc = GradientBoostingClassifier(random_state=42)
gbc.fit(X_train, y_train)
y_pred_gbc = gbc.predict(X_test)
print(f"GradientBoostingClassifier accuracy: {accuracy_score(y_test, y_pred_gbc):.3f}")
print(f"GradientBoostingClassifier F1 score: {f1_score(y_test, y_pred_gbc):.3f}")
# Fit and evaluate HistGradientBoostingClassifier with default hyperparameters
hgbc = HistGradientBoostingClassifier(random_state=42)
hgbc.fit(X_train, y_train)
y_pred_hgbc = hgbc.predict(X_test)
print(f"\nHistGradientBoostingClassifier accuracy: {accuracy_score(y_test, y_pred_hgbc):.3f}")
print(f"HistGradientBoostingClassifier F1 score: {f1_score(y_test, y_pred_hgbc):.3f}")
GradientBoostingClassifier accuracy: 0.910
GradientBoostingClassifier F1 score: 0.836
HistGradientBoostingClassifier accuracy: 0.910
HistGradientBoostingClassifier F1 score: 0.833
The steps are as follows:
- Generate a synthetic binary classification dataset using
make_classification
. - Split the data into training and test sets using
train_test_split
. - Instantiate
GradientBoostingClassifier
with default hyperparameters, fit it on the training data, and evaluate its performance on the test set. - Instantiate
HistGradientBoostingClassifier
with default hyperparameters, fit it on the training data, and evaluate its performance on the test set. - Compare the test set performance (accuracy and F1 score) of both models.