SKLearner Home | About | Contact | Examples

Configure SGDClassifier "validation_fraction" Parameter

The validation_fraction parameter in scikit-learn’s SGDClassifier determines the proportion of training data to set aside as a validation set for early stopping.

Stochastic Gradient Descent (SGD) is an optimization method used in various machine learning algorithms. Early stopping is a technique to prevent overfitting by monitoring the model’s performance on a validation set during training.

The validation_fraction parameter controls how much of the training data is used for early stopping validation. A larger value provides more data for validation but reduces the amount of data available for training.

The default value for validation_fraction is 0.1, meaning 10% of the training data is used for validation. Common values range from 0.1 to 0.3, depending on the dataset size and problem complexity.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=10,
                           n_redundant=5, n_classes=2, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different validation_fraction values
val_fractions = [0.1, 0.2, 0.3]
accuracies = []

for val_frac in val_fractions:
    sgd = SGDClassifier(loss='log_loss', max_iter=1000, tol=1e-3,
                        validation_fraction=val_frac, n_iter_no_change=5,
                        random_state=42)
    sgd.fit(X_train, y_train)
    y_pred = sgd.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"validation_fraction={val_frac}, Accuracy: {accuracy:.3f}, "
          f"Iterations: {sgd.n_iter_}")

Running the example gives an output like:

validation_fraction=0.1, Accuracy: 0.795, Iterations: 76
validation_fraction=0.2, Accuracy: 0.795, Iterations: 76
validation_fraction=0.3, Accuracy: 0.795, Iterations: 76

The key steps in this example are:

  1. Generate a synthetic binary classification dataset
  2. Split the data into train and test sets
  3. Train SGDClassifier models with different validation_fraction values
  4. Evaluate the accuracy of each model on the test set
  5. Compare the number of iterations and final accuracy for each configuration

Some tips and heuristics for setting validation_fraction:

Issues to consider:



See Also