SKLearner Home | About | Contact | Examples

Scikit-Learn SpectralBiclustering Model

Spectral Biclustering is a clustering algorithm designed for finding biclusters in a matrix, where biclusters are subsets of rows and columns that exhibit coherent patterns. This algorithm can handle noise and missing data effectively.

Key hyperparameters include n_clusters (number of biclusters), method (method for calculating the singular value decomposition), and n_init (number of random initializations).

This algorithm is suitable for applications such as text mining, bioinformatics, and recommendation systems where discovering hidden patterns in data matrices is crucial.

from sklearn.datasets import make_biclusters
from sklearn.cluster import SpectralBiclustering
from sklearn.metrics import consensus_score

# generate synthetic bicluster dataset
data, rows, columns = make_biclusters(shape=(300, 300), n_clusters=5, noise=0.1, random_state=0)

# create model
model = SpectralBiclustering(n_clusters=5, method='bistochastic', n_init=10, random_state=0)

# fit model
model.fit(data)

# evaluate model
score = consensus_score(model.biclusters_, (rows, columns))
print('Consensus score: %.3f' % score)

# predict biclusters
row_labels = model.row_labels_
column_labels = model.column_labels_
print('Row labels:', row_labels[:10])
print('Column labels:', column_labels[:10])

Running the example gives an output like:

Consensus score: 0.200
Row labels: [3 3 1 3 4 2 4 3 0 0]
Column labels: [3 1 1 0 0 4 1 2 2 2]

The steps are as follows:

  1. First, a synthetic bicluster dataset is generated using the make_biclusters() function. This creates a matrix with a specified shape (shape), number of biclusters (n_clusters), and a fixed random seed (random_state) for reproducibility. The dataset includes noise to simulate real-world conditions.

  2. Next, a SpectralBiclustering model is instantiated with specified hyperparameters such as n_clusters, method, and n_init. The model is then fit on the synthetic data using the fit() method.

  3. The performance of the model is evaluated using the consensus score, which measures the agreement between the true biclusters and those found by the model.

  4. The row and column labels for the biclusters are obtained using the row_labels_ and column_labels_ attributes of the fitted model, respectively. This demonstrates how to identify the biclusters in the data.



See Also