The algorithm
parameter in scikit-learn’s AdaBoostClassifier
determines which boosting algorithm to use.
AdaBoost (Adaptive Boosting) is an ensemble method that combines weak learners sequentially, giving more weight to misclassified samples in each iteration.
The algorithm
parameter allows choosing between ‘SAMME’ (discrete boosting) and ‘SAMME.R’ (real boosting). SAMME.R generally performs better but requires estimators to provide probability estimates.
The default value for algorithm
is ‘SAMME.R’.
In practice, ‘SAMME.R’ is often preferred when using decision trees with max_depth > 1
, while ‘SAMME’ can be useful with very shallow trees or other base estimators.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, roc_auc_score
import time
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
n_redundant=5, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Base estimator
base_estimator = DecisionTreeClassifier(max_depth=1)
# Train with different algorithm values
algorithms = ['SAMME', 'SAMME.R']
for alg in algorithms:
start_time = time.time()
ada = AdaBoostClassifier(base_estimator=base_estimator, algorithm=alg, random_state=42)
ada.fit(X_train, y_train)
train_time = time.time() - start_time
y_pred = ada.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
auc = roc_auc_score(y_test, ada.predict_proba(X_test)[:, 1])
print(f"Algorithm: {alg}")
print(f"Accuracy: {accuracy:.3f}")
print(f"ROC AUC: {auc:.3f}")
print(f"Training time: {train_time:.3f} seconds\n")
Running the example gives an output like:
Algorithm: SAMME
Accuracy: 0.825
ROC AUC: 0.904
Training time: 0.141 seconds
Algorithm: SAMME.R
Accuracy: 0.825
ROC AUC: 0.888
Training time: 0.145 seconds
The key steps in this example are:
- Generate a synthetic binary classification dataset
- Split the data into train and test sets
- Train
AdaBoostClassifier
models with both ‘SAMME’ and ‘SAMME.R’ algorithms - Evaluate the accuracy, ROC AUC score, and training time for each model
Some tips and heuristics for choosing the algorithm
:
- Use ‘SAMME.R’ when base estimators can provide probability estimates
- Consider ‘SAMME’ for very shallow trees (stumps) or non-tree base estimators
- ‘SAMME.R’ often converges faster and may require fewer estimators
Issues to consider:
- ‘SAMME.R’ requires probability estimates, which not all base estimators provide
- The performance difference between algorithms can vary based on the dataset and base estimator
- ‘SAMME’ might be more robust in some cases, especially with noisy data or weak base estimators