Scikit-Learn LabelPropagation Model

LabelPropagation is a semi-supervised learning algorithm that propagates labels from labeled instances to unlabeled ones based on their similarity. It’s useful when you have a small amount of labeled data and a large amount of unlabeled data.

The key hyperparameters of LabelPropagation include the kernel (similarity function), gamma (kernel coefficient), n_neighbors (number of neighbors to use), and max_iter (maximum number of iterations).

The algorithm is appropriate for semi-supervised classification problems.

from sklearn.semi_supervised import LabelPropagation
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# generate a synthetic dataset with labeled and unlabeled samples
X, y = make_classification(n_samples=500, n_features=5, n_classes=2, n_clusters_per_class=1, n_informative=2, random_state=42)
y[:400] = -1  # mark most samples as unlabeled

# split the labeled data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X[400:], y[400:], test_size=0.2, random_state=42)

# create model and fit on labeled and unlabeled data
model = LabelPropagation()
model.fit(X, y)

# evaluate model on test set
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Test Accuracy: {accuracy:.3f}')

# make a prediction on a new, unlabeled sample
new_sample = [[0.1, 0.5, 0.3, 0.5, 0.9]]
prediction = model.predict(new_sample)
print(f'Predicted label for new sample: {prediction[0]}')

Running the example gives an output like:

Test Accuracy: 1.000
Predicted label for new sample: 1

The steps are as follows:

First, a synthetic dataset is generated using make_classification(). Most of the samples are then marked as unlabeled by setting their labels to -1. This simulates a semi-supervised learning scenario.
The labeled data is split into training and test sets using train_test_split().
A LabelPropagation model is instantiated with default hyperparameters. The model is then fit on the entire dataset, including both labeled and unlabeled samples.
The model’s performance is evaluated on the test set by comparing the predicted labels to the true labels using the accuracy score metric.
Finally, the trained model is used to predict the label of a new, unlabeled sample.

This example demonstrates how to use LabelPropagation for semi-supervised classification tasks, where a small amount of labeled data can be leveraged to classify a larger amount of unlabeled data. The model learns from the similarity between samples to propagate labels from labeled to unlabeled instances.

See Also