The shrinkage
parameter in scikit-learn’s LinearDiscriminantAnalysis
controls the regularization of the covariance matrix estimation.
Linear Discriminant Analysis (LDA) is a dimensionality reduction and classification technique that projects data onto a lower-dimensional space while maximizing class separability. The shrinkage
parameter helps address issues with small sample sizes or high-dimensional data.
Shrinkage reduces the variance of the covariance matrix estimation by shrinking it towards a diagonal matrix. This can improve the stability and generalization of the LDA model, especially when the number of features is large compared to the number of samples.
The default value for shrinkage
is None
, which means no shrinkage is applied. When set to ‘auto’, scikit-learn automatically determines the optimal shrinkage parameter using the Ledoit-Wolf method.
In practice, values between 0 and 1 can be used, with 0 meaning no shrinkage and 1 meaning full shrinkage to a diagonal matrix.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import accuracy_score
# Generate synthetic dataset
X, y = make_classification(n_samples=100, n_features=20, n_informative=2,
n_redundant=2, n_classes=2, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Train with different shrinkage values
shrinkage_values = [None, 'auto', 0.1, 0.5, 0.9]
accuracies = []
for shrinkage in shrinkage_values:
lda = LinearDiscriminantAnalysis(solver='lsqr', shrinkage=shrinkage)
lda.fit(X_train, y_train)
y_pred = lda.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracies.append(accuracy)
print(f"shrinkage={shrinkage}, Accuracy: {accuracy:.3f}")
Running the example gives an output like:
shrinkage=None, Accuracy: 0.967
shrinkage=auto, Accuracy: 0.967
shrinkage=0.1, Accuracy: 0.967
shrinkage=0.5, Accuracy: 0.967
shrinkage=0.9, Accuracy: 0.933
The key steps in this example are:
- Generate a synthetic multi-class classification dataset with informative and redundant features
- Split the data into train and test sets
- Train
LinearDiscriminantAnalysis
models with differentshrinkage
values - Evaluate the accuracy of each model on the test set
Some tips and heuristics for setting shrinkage
:
- Use ‘auto’ for automatic shrinkage estimation when unsure about the optimal value
- Try values between 0 and 1 to find the best performance for your specific dataset
- Consider using shrinkage when dealing with high-dimensional data or small sample sizes
Issues to consider:
- The optimal shrinkage value depends on the dataset characteristics and sample size
- Too little shrinkage may not sufficiently regularize the covariance estimation
- Too much shrinkage can oversimplify the model and lead to underfitting
- The effect of shrinkage may be more pronounced in high-dimensional or noisy datasets