Configure LinearDiscriminantAnalysis "store_covariance" Parameter

The store_covariance parameter in scikit-learn’s LinearDiscriminantAnalysis controls whether to explicitly compute and store the covariance matrix.

Linear Discriminant Analysis (LDA) is a method used for classification and dimensionality reduction. It projects the data onto a lower-dimensional space while maximizing the separability between classes.

The store_covariance parameter determines whether the covariance matrix is explicitly computed and stored. When set to True, it allows access to the covariance matrix after fitting, but increases memory usage.

By default, store_covariance is set to False to save memory. Set it to True when you need to access the covariance matrix for further analysis or when working with small datasets where memory isn’t a concern.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import accuracy_score
import numpy as np

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=3,
                           n_informative=10, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train LDA models with different store_covariance values
lda_false = LinearDiscriminantAnalysis(store_covariance=False)
lda_true = LinearDiscriminantAnalysis(store_covariance=True)

lda_false.fit(X_train, y_train)
lda_true.fit(X_train, y_train)

# Compare memory usage
size_false = lda_false.__sizeof__()
size_true = lda_true.__sizeof__()

# Evaluate models
y_pred_false = lda_false.predict(X_test)
y_pred_true = lda_true.predict(X_test)

acc_false = accuracy_score(y_test, y_pred_false)
acc_true = accuracy_score(y_test, y_pred_true)

print(f"store_covariance=False: Size={size_false} bytes, Accuracy={acc_false:.3f}")
print(f"store_covariance=True: Size={size_true} bytes, Accuracy={acc_true:.3f}")

# Access covariance matrix (only available when store_covariance=True)
try:
    cov_matrix = lda_false.covariance_
    print("Covariance matrix accessible for store_covariance=False")
except AttributeError:
    print("Covariance matrix not accessible for store_covariance=False")

try:
    cov_matrix = lda_true.covariance_
    print("Covariance matrix accessible for store_covariance=True")
    print("Covariance matrix shape:", cov_matrix.shape)
except AttributeError:
    print("Covariance matrix not accessible for store_covariance=True")

Running the example gives an output like:

store_covariance=False: Size=16 bytes, Accuracy=0.740
store_covariance=True: Size=16 bytes, Accuracy=0.740
Covariance matrix not accessible for store_covariance=False
Covariance matrix accessible for store_covariance=True
Covariance matrix shape: (20, 20)

The key steps in this example are:

Generate a synthetic multi-class dataset suitable for LDA
Split the data into train and test sets
Create two LDA models with different store_covariance values
Fit both models and compare memory usage and accuracy
Attempt to access the covariance matrix for both models

Some tips and heuristics for setting store_covariance:

Use True when you need to access the covariance matrix for further analysis
Use False (default) to save memory, especially for large datasets
Consider computational resources and dataset size when deciding

Issues to consider:

There’s a trade-off between memory usage and access to the covariance matrix
The parameter doesn’t affect model performance, only memory usage and matrix accessibility
For very large datasets, storing the covariance matrix may cause memory issues

See Also