Scikit-Learn make_biclusters() Dataset

Datasets

The make_biclusters function from scikit-learn is used to create a synthetic dataset with biclusters, useful for testing biclustering algorithms.

The dataset has a rectangular structure where rows and columns represent clusters.

Key function arguments include shape to define the dimensions, n_clusters to set the number of clusters, and noise to add Gaussian noise.

This example will show how to generate and explore a bicluster dataset, which is suitable for clustering algorithms like Spectral Biclustering and Spectral Co-Clustering.

from sklearn.datasets import make_biclusters
import matplotlib.pyplot as plt

# Generate the bicluster dataset
data, rows, columns = make_biclusters(shape=(300, 300), n_clusters=5, noise=0.1, random_state=42)

# Display dataset shape
print(f"Dataset shape: {data.shape}")

# Report summary statistics
print(f"Summary statistics:\n{data[:5, :5]}")

# Plot the dataset
plt.matshow(data, cmap=plt.cm.Blues)
plt.title("Bicluster Dataset")
plt.colorbar()
plt.show()

# Split the dataset into input and cluster labels
X = data
y = rows
print(f"Input shape: {X.shape}")
print(f"Cluster labels shape: {y.shape}")

Running the example gives an output like:

Dataset shape: (300, 300)
Summary statistics:
[[-9.85836962e-02  6.39047939e+01  1.74070831e-02 -5.83040451e-02
  -3.47657187e-04]
 [ 2.40573536e+01  1.37372720e-01 -9.18707127e-02 -9.59893970e-02
   6.55866384e-02]
 [ 3.08499595e-02  1.04896532e-01  7.59162172e+01 -1.85525740e-01
   7.59096263e+01]
 [-6.59746209e-03  2.07563754e-01  7.42859480e-02 -7.47441682e-03
  -4.47431808e-03]
 [ 9.29741673e-03  6.39480030e+01 -7.64993047e-02 -2.38614490e-02
   9.62999294e-02]]

Input shape: (300, 300)
Cluster labels shape: (5, 300)

Scikit-Learn make_biclusters() Dataset

Import the make_biclusters function from sklearn.datasets and matplotlib.pyplot:
- These libraries will be used to create and visualize the synthetic bicluster dataset.
Generate the bicluster dataset using make_biclusters():
- Set shape=(300, 300) for a 300x300 matrix.
- Define n_clusters=5 for creating 5 biclusters.
- Use noise=0.1 to add a small amount of Gaussian noise for realism.
- Set random_state=42 for reproducibility.
Print the dataset shape:
- Confirm the dimensions using data.shape.
Report summary statistics:
- Display the first few values of the dataset using data[:5, :5].
Plot the dataset:
- Visualize the biclusters with plt.matshow(data, cmap=plt.cm.Blues).
- Add a colorbar and title for clarity.
Split the dataset into input and cluster labels:
- Separate the bicluster matrix (X) from the cluster labels (y).
- Print the shapes of X and y to confirm the split.

This example demonstrates how to generate a synthetic bicluster dataset using scikit-learn’s make_biclusters() function, allowing you to inspect the data’s shape, summary statistics, and visualize the biclusters. This sets the stage for applying biclustering algorithms and further analysis.

See Also