Configure DecisionTreeRegressor "random_state" Parameter

The random_state parameter in scikit-learn’s DecisionTreeRegressor controls the randomness of the model.

Setting random_state to a fixed value ensures that the same results are generated each time the model is trained on the same data. This is useful for ensuring reproducibility of results.

The random_state parameter does not affect the performance of the model, only the consistency of the results across runs.

If random_state is not set, the model’s behavior will be different each time it is trained, even on the same data.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Generate synthetic dataset
X, y = make_regression(n_samples=100, n_features=1, noise=20, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different random_state values
random_state_values = [None, 42, 42]
mse_values = []

for rs in random_state_values:
    dt = DecisionTreeRegressor(random_state=rs)
    dt.fit(X_train, y_train)
    y_pred = dt.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_values.append(mse)
    print(f"random_state={rs}, MSE: {mse:.3f}")

Running the example gives an output like:

random_state=None, MSE: 881.838
random_state=42, MSE: 881.838
random_state=42, MSE: 881.838

The key steps in this example are:

Generate a synthetic regression dataset
Split the data into train and test sets
Train DecisionTreeRegressor models with different random_state values
Evaluate the mean squared error (MSE) of each model on the test set

Some tips for setting random_state:

Set it to a fixed value for reproducibility, especially when debugging or comparing models
If not set, the model will produce different results each run, even on same data

Issues to consider:

Setting random_state is good practice for ensuring reproducible results
It has no effect on model performance, only consistency across runs

See Also