Configure MLPClassifier "beta_1" Parameter

The beta_1 parameter in scikit-learn’s MLPClassifier controls the exponential decay rate for the first moment estimates in the Adam optimizer.

Adam (Adaptive Moment Estimation) is an optimization algorithm used for training neural networks. The beta_1 parameter influences how quickly the optimizer adapts to changes in the gradient.

A higher beta_1 value gives more weight to past gradients, resulting in slower adaptation. A lower value makes the optimizer more responsive to recent gradients.

The default value for beta_1 is 0.9.

In practice, values between 0.9 and 0.999 are commonly used, with 0.9 being a popular choice for many applications.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=3,
                           n_informative=10, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different beta_1 values
beta_1_values = [0.8, 0.9, 0.95, 0.99]
accuracies = []

for beta_1 in beta_1_values:
    mlp = MLPClassifier(hidden_layer_sizes=(100,), max_iter=1000, random_state=42, beta_1=beta_1)
    mlp.fit(X_train, y_train)
    y_pred = mlp.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"beta_1={beta_1}, Accuracy: {accuracy:.3f}")

Running the example gives an output like:

beta_1=0.8, Accuracy: 0.870
beta_1=0.9, Accuracy: 0.870
beta_1=0.95, Accuracy: 0.885
beta_1=0.99, Accuracy: 0.900

The key steps in this example are:

Generate a synthetic multi-class classification dataset
Split the data into train and test sets
Train MLPClassifier models with different beta_1 values
Evaluate the accuracy of each model on the test set

Some tips and heuristics for setting beta_1:

Start with the default value of 0.9 and adjust if needed
Lower values may help with sparse gradients or noisy data
Higher values can help stabilize training for smoother optimization

Issues to consider:

The optimal beta_1 value can depend on the specific dataset and problem
Very high values (close to 1) may slow down convergence
Very low values may cause the optimizer to become too sensitive to recent gradients
The effect of beta_1 may interact with other hyperparameters like learning rate

See Also