The random_state
parameter in scikit-learn’s Ridge
class is used to control the pseudo-random number generation for reproducibility of results across multiple runs.
Ridge Regression is a linear regression technique that adds L2 regularization to ordinary least squares. The random_state
parameter sets the seed of the pseudo-random number generator used when shuffling the data.
By default, random_state
is set to None
, which means the global random state from numpy.random
is used. This can cause different results each time the model is run.
To ensure reproducibility, random_state
should be set to an integer value. This will guarantee that the same results are generated each time the model is run with the same data and parameters.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.metrics import r2_score
# Generate synthetic dataset
X, y = make_regression(n_samples=100, n_features=1, noise=20, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different random_state values
random_state_values = [None, 42, 42, 128]
r2_scores = []
for rs in random_state_values:
ridge = Ridge(alpha=1.0, random_state=rs)
ridge.fit(X_train, y_train)
y_pred = ridge.predict(X_test)
r2 = r2_score(y_test, y_pred)
r2_scores.append(r2)
print(f"random_state={rs}, R-squared: {r2:.3f}, Coefficients: {ridge.coef_}")
Running the example gives an output like:
random_state=None, R-squared: 0.800, Coefficients: [46.04348947]
random_state=42, R-squared: 0.800, Coefficients: [46.04348947]
random_state=42, R-squared: 0.800, Coefficients: [46.04348947]
random_state=128, R-squared: 0.800, Coefficients: [46.04348947]
The key steps in this example are:
- Generate a synthetic regression dataset with noise
- Split the data into train and test sets
- Train
Ridge
models with differentrandom_state
values - Evaluate the R-squared of each model on the test set
- Compare the model coefficients and scores
Some tips for setting random_state
:
- Use an integer value for reproducibility across runs
- Models trained with the same random state and data will be identical
Issues to consider:
- The default
None
value will produce different results each time - Consistent seeding is important when comparing models or for reproducibility