OrthogonalMatchingPursuit is a powerful algorithm for sparse linear modeling. It fits a linear model with constraints on the number of non-zero coefficients, which can be useful in high-dimensional spaces where feature selection is critical.
In scikit-learn, the OrthogonalMatchingPursuit
class provides an implementation of this algorithm. Key hyperparameters include n_nonzero_coefs
(the number of non-zero coefficients) and tol
(the tolerance for the stopping criterion). These hyperparameters need to be manually set based on domain knowledge and specific use cases.
On the other hand, OrthogonalMatchingPursuitCV
automates the hyperparameter tuning process through cross-validation. Its key hyperparameters include cv
(the number of folds for cross-validation) and alphas
(list of alpha values to try). This approach can help in selecting the optimal number of non-zero coefficients automatically.
The main difference between these two classes is that OrthogonalMatchingPursuit
requires manual tuning of hyperparameters, while OrthogonalMatchingPursuitCV
performs this tuning automatically using cross-validation. However, OrthogonalMatchingPursuitCV
is computationally more expensive due to the cross-validation process.
OrthogonalMatchingPursuit
is ideal for scenarios where the number of non-zero coefficients is known or can be estimated, allowing for faster model fitting. OrthogonalMatchingPursuitCV
, on the other hand, is preferred when hyperparameter tuning is required to achieve better model performance.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import OrthogonalMatchingPursuit, OrthogonalMatchingPursuitCV
from sklearn.metrics import mean_squared_error
# Generate synthetic regression dataset
X, y = make_regression(n_samples=1000, n_features=100, n_informative=10, noise=0.1, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Fit and evaluate OrthogonalMatchingPursuit with default hyperparameters
omp = OrthogonalMatchingPursuit(n_nonzero_coefs=10, tol=1e-4)
omp.fit(X_train, y_train)
y_pred_omp = omp.predict(X_test)
print(f"OrthogonalMatchingPursuit MSE: {mean_squared_error(y_test, y_pred_omp):.3f}")
# Fit and evaluate OrthogonalMatchingPursuitCV with cross-validation
ompcv = OrthogonalMatchingPursuitCV(cv=5)
ompcv.fit(X_train, y_train)
y_pred_ompcv = ompcv.predict(X_test)
print(f"\nOrthogonalMatchingPursuitCV MSE: {mean_squared_error(y_test, y_pred_ompcv):.3f}")
print(f"Best number of non-zero coefficients: {ompcv.n_nonzero_coefs_}")
Running the example gives an output like:
OrthogonalMatchingPursuit MSE: 0.011
OrthogonalMatchingPursuitCV MSE: 0.010
Best number of non-zero coefficients: 10
The steps are as follows:
- Generate a synthetic regression dataset using
make_regression
. - Split the data into training and test sets using
train_test_split
. - Instantiate
OrthogonalMatchingPursuit
withn_nonzero_coefs
set to 10 and defaulttol
, fit it on the training data, and evaluate its performance on the test set. - Instantiate
OrthogonalMatchingPursuitCV
with 5-fold cross-validation, fit it on the training data, and evaluate its performance on the test set. - Compare the test set performance (mean squared error) of both models and print the best number of non-zero coefficients found by
OrthogonalMatchingPursuitCV
.