SKLearner Home | About | Contact | Examples

Configure SGDClassifier "shuffle" Parameter

The shuffle parameter in scikit-learn’s SGDClassifier determines whether the training data is shuffled before each epoch.

Stochastic Gradient Descent (SGD) is an iterative optimization algorithm that updates model parameters based on individual training samples. The shuffle parameter controls whether these samples are presented in a random order.

Shuffling can improve convergence speed and final model performance by reducing the impact of data ordering on parameter updates. It helps prevent the model from learning spurious patterns that may exist in the original data order.

The default value for shuffle is True. In practice, it’s often beneficial to keep shuffling enabled unless there’s a specific reason to maintain the original data order.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
                           n_redundant=5, n_classes=2, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with shuffle=True
sgd_shuffle = SGDClassifier(random_state=42, max_iter=100)
sgd_shuffle.fit(X_train, y_train)
y_pred_shuffle = sgd_shuffle.predict(X_test)
accuracy_shuffle = accuracy_score(y_test, y_pred_shuffle)

# Train with shuffle=False
sgd_no_shuffle = SGDClassifier(shuffle=False, random_state=42, max_iter=100)
sgd_no_shuffle.fit(X_train, y_train)
y_pred_no_shuffle = sgd_no_shuffle.predict(X_test)
accuracy_no_shuffle = accuracy_score(y_test, y_pred_no_shuffle)

print(f"Accuracy with shuffle=True: {accuracy_shuffle:.3f}")
print(f"Accuracy with shuffle=False: {accuracy_no_shuffle:.3f}")

Running the example gives an output like:

Accuracy with shuffle=True: 0.770
Accuracy with shuffle=False: 0.770

The key steps in this example are:

  1. Generate a synthetic binary classification dataset
  2. Split the data into train and test sets
  3. Train SGDClassifier models with shuffle=True and shuffle=False
  4. Evaluate and compare the accuracy of each model on the test set

Tips for using the shuffle parameter:

Issues to consider:



See Also