Scikit-Learn "OrthogonalMatchingPursuit" versus "OrthogonalMatchingPursuitCV"

OrthogonalMatchingPursuit is a powerful algorithm for sparse linear modeling. It fits a linear model with constraints on the number of non-zero coefficients, which can be useful in high-dimensional spaces where feature selection is critical.

In scikit-learn, the OrthogonalMatchingPursuit class provides an implementation of this algorithm. Key hyperparameters include n_nonzero_coefs (the number of non-zero coefficients) and tol (the tolerance for the stopping criterion). These hyperparameters need to be manually set based on domain knowledge and specific use cases.

On the other hand, OrthogonalMatchingPursuitCV automates the hyperparameter tuning process through cross-validation. Its key hyperparameters include cv (the number of folds for cross-validation) and alphas (list of alpha values to try). This approach can help in selecting the optimal number of non-zero coefficients automatically.

The main difference between these two classes is that OrthogonalMatchingPursuit requires manual tuning of hyperparameters, while OrthogonalMatchingPursuitCV performs this tuning automatically using cross-validation. However, OrthogonalMatchingPursuitCV is computationally more expensive due to the cross-validation process.

OrthogonalMatchingPursuit is ideal for scenarios where the number of non-zero coefficients is known or can be estimated, allowing for faster model fitting. OrthogonalMatchingPursuitCV, on the other hand, is preferred when hyperparameter tuning is required to achieve better model performance.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import OrthogonalMatchingPursuit, OrthogonalMatchingPursuitCV
from sklearn.metrics import mean_squared_error

# Generate synthetic regression dataset
X, y = make_regression(n_samples=1000, n_features=100, n_informative=10, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit and evaluate OrthogonalMatchingPursuit with default hyperparameters
omp = OrthogonalMatchingPursuit(n_nonzero_coefs=10, tol=1e-4)
omp.fit(X_train, y_train)
y_pred_omp = omp.predict(X_test)
print(f"OrthogonalMatchingPursuit MSE: {mean_squared_error(y_test, y_pred_omp):.3f}")

# Fit and evaluate OrthogonalMatchingPursuitCV with cross-validation
ompcv = OrthogonalMatchingPursuitCV(cv=5)
ompcv.fit(X_train, y_train)
y_pred_ompcv = ompcv.predict(X_test)
print(f"\nOrthogonalMatchingPursuitCV MSE: {mean_squared_error(y_test, y_pred_ompcv):.3f}")
print(f"Best number of non-zero coefficients: {ompcv.n_nonzero_coefs_}")

Running the example gives an output like:

OrthogonalMatchingPursuit MSE: 0.011

OrthogonalMatchingPursuitCV MSE: 0.010
Best number of non-zero coefficients: 10

The steps are as follows:

Generate a synthetic regression dataset using make_regression.
Split the data into training and test sets using train_test_split.
Instantiate OrthogonalMatchingPursuit with n_nonzero_coefs set to 10 and default tol, fit it on the training data, and evaluate its performance on the test set.
Instantiate OrthogonalMatchingPursuitCV with 5-fold cross-validation, fit it on the training data, and evaluate its performance on the test set.
Compare the test set performance (mean squared error) of both models and print the best number of non-zero coefficients found by OrthogonalMatchingPursuitCV.

See Also