Configure HistGradientBoostingClassifier "random_state" Parameter

The random_state parameter in scikit-learn’s HistGradientBoostingClassifier controls the randomness of the model’s training process.

Histogram-based gradient boosting is an efficient implementation of gradient boosting that uses binning to speed up the training process. It builds an ensemble of decision trees sequentially, with each tree correcting the errors of the previous ones.

The random_state parameter ensures reproducibility of the model’s results. When set to a specific integer, it guarantees that the same sequence of random numbers is generated, leading to consistent model behavior across different runs.

By default, random_state is set to None, which means that the random number generator is the RandomState instance used by numpy.random. Common practice is to set it to a fixed integer for reproducibility.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train models with different random_state values
random_states = [None, 42, 100, 200]
for rs in random_states:
    model = HistGradientBoostingClassifier(random_state=rs, max_iter=100)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"random_state={rs}, Accuracy: {accuracy:.4f}")

# Train multiple models with the same random_state
print("\nTraining multiple models with random_state=42:")
for _ in range(3):
    model = HistGradientBoostingClassifier(random_state=42, max_iter=100)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Accuracy: {accuracy:.4f}")

Running this example produces output similar to:

random_state=None, Accuracy: 0.9150
random_state=42, Accuracy: 0.9150
random_state=100, Accuracy: 0.9150
random_state=200, Accuracy: 0.9150

Training multiple models with random_state=42:
Accuracy: 0.9150
Accuracy: 0.9150
Accuracy: 0.9150

The key steps in this example are:

Generate a synthetic binary classification dataset
Split the data into train and test sets
Train HistGradientBoostingClassifier models with different random_state values
Evaluate the accuracy of each model on the test set
Demonstrate reproducibility by training multiple models with the same random_state

Tips for using random_state:

Set random_state to a fixed integer for reproducible results
Use different random_state values to assess model stability
Keep the same random_state across your entire machine learning pipeline for full reproducibility

Considerations:

Different random_state values may lead to slightly different model performances
Reproducibility is crucial for debugging, result verification, and scientific experiments
In production environments, you might want to use different random_state values for model diversity

See Also