Configure DecisionTreeClassifier "random_state" Parameter

The random_state parameter in scikit-learn’s DecisionTreeClassifier controls the randomness of the model training process.

Decision trees involve making random choices at various points during training, such as selecting features to split on. Setting random_state to a fixed value ensures that the same random choices are made each time the model is trained, leading to reproducible results.

If random_state is not set (or set to None), the random choices will be different each time, resulting in slightly different models even with the same training data and parameters.

The default value for random_state is None.

In practice, it’s common to set random_state to an arbitrary fixed value (e.g., 42) to ensure reproducibility while still allowing for randomness in the model.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_classes=2, n_features=10,
                           n_informative=5, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different random_state values
random_state_values = [None, 42, 123, 456]
accuracies = []

for rs in random_state_values:
    dt = DecisionTreeClassifier(random_state=rs)
    dt.fit(X_train, y_train)
    y_pred = dt.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"random_state={rs}, Accuracy: {accuracy:.3f}")

The output would look like:

random_state=None, Accuracy: 0.890
random_state=42, Accuracy: 0.895
random_state=123, Accuracy: 0.895
random_state=456, Accuracy: 0.900

The key steps in this example are:

Generate a synthetic binary classification dataset
Split the data into train and test sets
Train DecisionTreeClassifier models with different random_state values
Evaluate the accuracy of each model on the test set

Tips and heuristics for setting random_state:

Use a fixed value for random_state to ensure reproducibility of results
The specific value doesn’t matter as long as it’s fixed, but using a memorable number like 42 is common
If you want different random models each time, set random_state to None or don’t specify it

Issues to consider:

Setting different random_state values will result in slightly different models, even with the same data and parameters
Not setting random_state (or setting it to None) will cause different results each time the code is run, making it harder to reproduce findings

See Also