Configure GradientBoostingClassifier "random_state" Parameter

The random_state parameter in scikit-learn’s GradientBoostingClassifier is used to control the randomness of the model for reproducibility purposes.

Gradient Boosting is an ensemble method that sequentially trains weak learners (decision trees) to correct the mistakes of the previous learners. The random_state parameter sets the seed for the random number generator used in the model’s random operations, such as subsampling of the training data and feature sampling.

Setting random_state to a fixed value ensures that the same sequence of random operations is used each time the model is trained, leading to identical results across runs. However, it does not affect the model’s performance, only the specific sequence of random operations.

The default value for random_state is None, meaning the randomness is not controlled and results may vary slightly across runs.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=2, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different random_state values
random_state_values = [None, 42, 123]

for rs in random_state_values:
    gb = GradientBoostingClassifier(random_state=rs)
    gb.fit(X_train, y_train)
    y_pred = gb.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"random_state={rs}, Accuracy: {accuracy:.3f}")
    print(f"First few predictions: {y_pred[:5]}\n")

Running the example gives an output like:

random_state=None, Accuracy: 0.910
First few predictions: [1 1 0 1 1]

random_state=42, Accuracy: 0.915
First few predictions: [1 1 0 1 1]

random_state=123, Accuracy: 0.910
First few predictions: [1 1 0 1 1]

The key steps in this example are:

Generate a synthetic binary classification dataset
Split the data into train and test sets
Train GradientBoostingClassifier models with different random_state values
Evaluate the accuracy of each model on the test set and compare the first few predictions

Some tips and heuristics for setting random_state:

Set random_state to a fixed value for reproducibility across runs
Using the same random_state value will give identical results
random_state does not impact model performance, only the specific sequence of random operations

Issues to consider:

Not setting random_state means results may vary slightly across runs due to randomness
Different random_state values will produce different specific predictions but similar overall performance

See Also