SKLearner Home | About | Contact | Examples

Scikit-Learn fetch_lfw_people() Dataset

The LFW (Labeled Faces in the Wild) people dataset consists of images of faces collected from the web and is widely used for face recognition and image classification tasks. The images are labeled with the name of the person pictured.

Key function arguments when loading the dataset include min_faces_per_person to specify the minimum number of pictures per person to include, and resize to reduce the computational load by resizing the images.

This is an image classification problem where common algorithms like Support Vector Machines (SVM), k-Nearest Neighbors (k-NN), and Convolutional Neural Networks (CNN) are often applied.

from sklearn.datasets import fetch_lfw_people
import matplotlib.pyplot as plt

# Fetch the dataset
lfw_people = fetch_lfw_people(min_faces_per_person=70, resize=0.4)

# Display dataset shape and types
print(f"Dataset shape: {lfw_people.data.shape}")
print(f"Feature types: {lfw_people.data.dtype}")

# Show summary statistics
print(f"Number of classes: {len(lfw_people.target_names)}")
print(f"Number of samples per class:\n{[(name, sum(lfw_people.target == idx)) for idx, name in enumerate(lfw_people.target_names)]}")

# Display first few rows of the dataset
print(f"First few labels of the dataset:\n{lfw_people.target[:5]}")
print(f"First few images of the dataset:")

# Plot examples from the dataset
fig, axes = plt.subplots(1, 5, figsize=(15, 8), subplot_kw={'xticks':[], 'yticks':[]})
for i, ax in enumerate(axes):
    ax.imshow(lfw_people.images[i], cmap='gray')
    ax.set_title(lfw_people.target_names[lfw_people.target[i]])
plt.show()

Running the example gives an output like:

Dataset shape: (1288, 1850)
Feature types: float32
Number of classes: 7
Number of samples per class:
[('Ariel Sharon', 77), ('Colin Powell', 236), ('Donald Rumsfeld', 121), ('George W Bush', 530), ('Gerhard Schroeder', 109), ('Hugo Chavez', 71), ('Tony Blair', 144)]
First few labels of the dataset:
[5 6 3 1 0]

Scikit-Learn fetch_lfw_people plot

The steps are as follows:

  1. Import the fetch_lfw_people function from sklearn.datasets:

    • This function allows us to load the LFW people dataset directly from the scikit-learn library.
  2. Fetch the dataset using fetch_lfw_people():

    • Use min_faces_per_person=70 to include only those individuals with at least 70 pictures.
    • Use resize=0.4 to resize images to 40% of their original size, reducing computational load.
  3. Print the dataset shape and feature types:

    • Access the shape using lfw_people.data.shape.
    • Show the data type of the features using lfw_people.data.dtype.
  4. Display summary statistics:

    • Print the number of classes using len(lfw_people.target_names).
    • Show the number of samples per class to understand the dataset distribution.
  5. Display the first few labels and plot the first few images of the dataset:

    • Print the initial labels using lfw_people.target[:5].
    • Plot the first few images with corresponding labels using matplotlib for a quick visual inspection.

This example demonstrates how to load and explore the LFW people dataset using scikit-learn’s fetch_lfw_people() function, allowing you to inspect the data’s shape, types, class distribution, and visualize sample images. This sets the stage for further preprocessing and application of image classification algorithms.



See Also