SKLearner Home | About | Contact | Examples

Scikit-Learn SparseRandomProjection for Dimensionality Reduction

Dimensionality reduction is crucial when dealing with high-dimensional data. SparseRandomProjection is an efficient method to reduce dimensions while preserving the distances between points, making it suitable for preprocessing high-dimensional data before applying other machine learning algorithms.

The key hyperparameters of SparseRandomProjection include n_components, which specifies the number of dimensions to project to, and eps, a parameter related to the Johnson-Lindenstrauss lemma that influences the quality of the projection.

The algorithm is appropriate for preprocessing tasks in both classification and regression problems.

from sklearn.datasets import make_classification
from sklearn.random_projection import SparseRandomProjection
import matplotlib.pyplot as plt

# generate a high-dimensional dataset
X, y = make_classification(n_samples=100, n_features=50, n_classes=2, random_state=1)

# apply sparse random projection
transformer = SparseRandomProjection(n_components=2, random_state=1)
X_new = transformer.fit_transform(X)

# plot the transformed data
plt.scatter(X_new[:, 0], X_new[:, 1], c=y, cmap='viridis')
plt.xlabel('Component 1')
plt.ylabel('Component 2')
plt.title('Sparse Random Projection')
plt.show()

Running the example gives an output like:

Scikit-Learn SparseRandomProjection plot

  1. First, a synthetic high-dimensional dataset is generated using the make_classification() function. This creates a dataset with 100 samples and 50 features, split into two classes. The dataset is created with a fixed random seed (random_state) for reproducibility.

  2. Next, a SparseRandomProjection instance is created with n_components set to 2, indicating the number of dimensions to project the data onto. The high-dimensional data is transformed into 2 dimensions using the fit_transform() method.

  3. Finally, the transformed data is visualized using a scatter plot, with the data points colored by their class labels. This visualization helps to see the effect of the dimensionality reduction on the dataset.

This example shows how to use SparseRandomProjection for reducing the dimensionality of high-dimensional data, making it easier to visualize and further process for machine learning tasks.



See Also