The beta_2
parameter in scikit-learn’s MLPRegressor
controls the decay rate for the second moment estimate in the Adam optimizer.
Adam (Adaptive Moment Estimation) is an optimization algorithm used for updating network weights. The beta_2
parameter specifically affects how the optimizer estimates the second moment (uncentered variance) of the gradients.
A higher beta_2
value results in a slower decay of the second moment estimate, which can help smooth out the learning process in the presence of noisy gradients. Conversely, a lower value allows for quicker adaptation to changes in the gradient.
The default value for beta_2
is 0.999. In practice, values between 0.9 and 0.999 are commonly used, with 0.999 being a popular choice for many problems.
import numpy as np
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_squared_error
# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different beta_2 values
beta_2_values = [0.9, 0.99, 0.999, 0.9999]
mse_scores = []
for beta_2 in beta_2_values:
mlp = MLPRegressor(hidden_layer_sizes=(100,), max_iter=500, random_state=42, beta_2=beta_2)
mlp.fit(X_train, y_train)
y_pred = mlp.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
mse_scores.append(mse)
print(f"beta_2={beta_2}, MSE: {mse:.4f}")
Running the example gives an output like:
beta_2=0.9, MSE: 3.3350
beta_2=0.99, MSE: 21.6741
beta_2=0.999, MSE: 139.3109
beta_2=0.9999, MSE: 152.1716
The key steps in this example are:
- Generate a synthetic regression dataset with multiple features
- Split the data into train and test sets
- Train
MLPRegressor
models with differentbeta_2
values - Evaluate the mean squared error of each model on the test set
Some tips and heuristics for setting beta_2
:
- Start with the default value of 0.999 and adjust if needed
- Use higher values (closer to 1) for problems with sparse gradients
- Lower values may work better for problems with rapidly changing gradients
Issues to consider:
- The optimal
beta_2
value can depend on the specific problem and dataset - Very high values (>0.999) might slow down convergence
- Very low values (<0.9) may lead to unstable training
- Consider the interplay between
beta_2
and other optimizer parameters like learning rate