The store_covariance
parameter in scikit-learn’s LinearDiscriminantAnalysis
controls whether to explicitly compute and store the covariance matrix.
Linear Discriminant Analysis (LDA) is a method used for classification and dimensionality reduction. It projects the data onto a lower-dimensional space while maximizing the separability between classes.
The store_covariance
parameter determines whether the covariance matrix is explicitly computed and stored. When set to True
, it allows access to the covariance matrix after fitting, but increases memory usage.
By default, store_covariance
is set to False
to save memory. Set it to True
when you need to access the covariance matrix for further analysis or when working with small datasets where memory isn’t a concern.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import accuracy_score
import numpy as np
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=3,
n_informative=10, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train LDA models with different store_covariance values
lda_false = LinearDiscriminantAnalysis(store_covariance=False)
lda_true = LinearDiscriminantAnalysis(store_covariance=True)
lda_false.fit(X_train, y_train)
lda_true.fit(X_train, y_train)
# Compare memory usage
size_false = lda_false.__sizeof__()
size_true = lda_true.__sizeof__()
# Evaluate models
y_pred_false = lda_false.predict(X_test)
y_pred_true = lda_true.predict(X_test)
acc_false = accuracy_score(y_test, y_pred_false)
acc_true = accuracy_score(y_test, y_pred_true)
print(f"store_covariance=False: Size={size_false} bytes, Accuracy={acc_false:.3f}")
print(f"store_covariance=True: Size={size_true} bytes, Accuracy={acc_true:.3f}")
# Access covariance matrix (only available when store_covariance=True)
try:
cov_matrix = lda_false.covariance_
print("Covariance matrix accessible for store_covariance=False")
except AttributeError:
print("Covariance matrix not accessible for store_covariance=False")
try:
cov_matrix = lda_true.covariance_
print("Covariance matrix accessible for store_covariance=True")
print("Covariance matrix shape:", cov_matrix.shape)
except AttributeError:
print("Covariance matrix not accessible for store_covariance=True")
Running the example gives an output like:
store_covariance=False: Size=16 bytes, Accuracy=0.740
store_covariance=True: Size=16 bytes, Accuracy=0.740
Covariance matrix not accessible for store_covariance=False
Covariance matrix accessible for store_covariance=True
Covariance matrix shape: (20, 20)
The key steps in this example are:
- Generate a synthetic multi-class dataset suitable for LDA
- Split the data into train and test sets
- Create two LDA models with different
store_covariance
values - Fit both models and compare memory usage and accuracy
- Attempt to access the covariance matrix for both models
Some tips and heuristics for setting store_covariance
:
- Use
True
when you need to access the covariance matrix for further analysis - Use
False
(default) to save memory, especially for large datasets - Consider computational resources and dataset size when deciding
Issues to consider:
- There’s a trade-off between memory usage and access to the covariance matrix
- The parameter doesn’t affect model performance, only memory usage and matrix accessibility
- For very large datasets, storing the covariance matrix may cause memory issues