Configure HistGradientBoostingRegressor "random_state" Parameter

The random_state parameter in scikit-learn’s HistGradientBoostingRegressor controls the randomness of the model, ensuring reproducibility of results.

HistGradientBoostingRegressor is a gradient boosting algorithm that uses histogram-based decision trees. It’s designed for efficiency and performance, particularly on large datasets.

The random_state parameter sets the seed for the random number generator used in the model. This affects the randomness in various parts of the algorithm, such as subsampling of the dataset and feature selection for splitting nodes.

By default, random_state is set to None, which means the model will use a different random seed each time it’s run. Setting it to an integer value ensures consistent results across multiple runs.

Common values for random_state are typically arbitrary integers, such as 42, 0, or 123. The specific value doesn’t matter as long as it’s consistent across runs for reproducibility.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.metrics import mean_squared_error

# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different random_state values
random_state_values = [None, 0, 42, 123]
mse_scores = []

for rs in random_state_values:
    model = HistGradientBoostingRegressor(random_state=rs)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)
    print(f"random_state={rs}, MSE: {mse:.4f}")

# Train multiple times with random_state=None
mse_scores_none = []
for _ in range(3):
    model = HistGradientBoostingRegressor(random_state=None)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores_none.append(mse)
    print(f"random_state=None (repeated), MSE: {mse:.4f}")

Running the example gives an output like:

random_state=None, MSE: 3073.5886
random_state=0, MSE: 3073.5886
random_state=42, MSE: 3073.5886
random_state=123, MSE: 3073.5886
random_state=None (repeated), MSE: 3073.5886
random_state=None (repeated), MSE: 3073.5886
random_state=None (repeated), MSE: 3073.5886

The key steps in this example are:

Generate a synthetic regression dataset
Split the data into train and test sets
Train HistGradientBoostingRegressor models with different random_state values
Evaluate the mean squared error (MSE) of each model on the test set
Demonstrate the variability of results when random_state=None

Some tips and heuristics for setting random_state:

Use a fixed integer value for reproducibility in research or production environments
Set random_state=None for randomized behavior, which can be useful in ensemble methods
Keep the random_state consistent across all randomized components in your pipeline

Issues to consider:

Different random_state values may lead to slightly different model performance
Using a fixed random_state might hide the model’s sensitivity to initial conditions
In production, consider using different random_state values for cross-validation splits and the final model

See Also