The beta_1
parameter in scikit-learn’s MLPClassifier
controls the exponential decay rate for the first moment estimates in the Adam optimizer.
Adam (Adaptive Moment Estimation) is an optimization algorithm used for training neural networks. The beta_1
parameter influences how quickly the optimizer adapts to changes in the gradient.
A higher beta_1
value gives more weight to past gradients, resulting in slower adaptation. A lower value makes the optimizer more responsive to recent gradients.
The default value for beta_1
is 0.9.
In practice, values between 0.9 and 0.999 are commonly used, with 0.9 being a popular choice for many applications.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=3,
n_informative=10, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different beta_1 values
beta_1_values = [0.8, 0.9, 0.95, 0.99]
accuracies = []
for beta_1 in beta_1_values:
mlp = MLPClassifier(hidden_layer_sizes=(100,), max_iter=1000, random_state=42, beta_1=beta_1)
mlp.fit(X_train, y_train)
y_pred = mlp.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracies.append(accuracy)
print(f"beta_1={beta_1}, Accuracy: {accuracy:.3f}")
Running the example gives an output like:
beta_1=0.8, Accuracy: 0.870
beta_1=0.9, Accuracy: 0.870
beta_1=0.95, Accuracy: 0.885
beta_1=0.99, Accuracy: 0.900
The key steps in this example are:
- Generate a synthetic multi-class classification dataset
- Split the data into train and test sets
- Train
MLPClassifier
models with differentbeta_1
values - Evaluate the accuracy of each model on the test set
Some tips and heuristics for setting beta_1
:
- Start with the default value of 0.9 and adjust if needed
- Lower values may help with sparse gradients or noisy data
- Higher values can help stabilize training for smoother optimization
Issues to consider:
- The optimal
beta_1
value can depend on the specific dataset and problem - Very high values (close to 1) may slow down convergence
- Very low values may cause the optimizer to become too sensitive to recent gradients
- The effect of
beta_1
may interact with other hyperparameters like learning rate