Scikit-Learn fetch_openml() Dataset

Datasets

The Iris dataset is a classic dataset for classification tasks with features such as sepal length, sepal width, petal length, and petal width.

Key function arguments when loading the dataset include name to specify the dataset on OpenML, version to specify the version of the dataset, and as_frame to get the data as a pandas DataFrame.

This is a classification problem where common algorithms like Logistic Regression, k-Nearest Neighbors, and Support Vector Machines are often applied.

from sklearn.datasets import fetch_openml

# Fetch the dataset
dataset = fetch_openml(name='iris', version=1, as_frame=True)

# Display dataset shape and types
print(f"Dataset shape: {dataset.data.shape}")
print(f"Feature types:\n{dataset.data.dtypes}")

# Show summary statistics
print(f"Summary statistics:\n{dataset.data.describe()}")

# Display first few rows of the dataset
print(f"First few rows of the dataset:\n{dataset.data.head()}")

Running the example gives an output like:

Dataset shape: (150, 4)
Feature types:
sepallength    float64
sepalwidth     float64
petallength    float64
petalwidth     float64
dtype: object
Summary statistics:
       sepallength  sepalwidth  petallength  petalwidth
count   150.000000  150.000000   150.000000  150.000000
mean      5.843333    3.054000     3.758667    1.198667
std       0.828066    0.433594     1.764420    0.763161
min       4.300000    2.000000     1.000000    0.100000
25%       5.100000    2.800000     1.600000    0.300000
50%       5.800000    3.000000     4.350000    1.300000
75%       6.400000    3.300000     5.100000    1.800000
max       7.900000    4.400000     6.900000    2.500000
First few rows of the dataset:
   sepallength  sepalwidth  petallength  petalwidth
0          5.1         3.5          1.4         0.2
1          4.9         3.0          1.4         0.2
2          4.7         3.2          1.3         0.2
3          4.6         3.1          1.5         0.2
4          5.0         3.6          1.4         0.2

The steps are as follows:

Import the fetch_openml function from sklearn.datasets:
- This function allows us to load datasets directly from the OpenML repository.
Fetch the dataset using fetch_openml():
- Use name='iris' to specify the Iris dataset.
- Set version=1 to ensure we get the correct version.
- Use as_frame=True to return the dataset as a pandas DataFrame for easier manipulation.
Print the dataset shape and feature types:
- Access the shape using dataset.data.shape.
- Show the data types of the features using dataset.data.dtypes.
Display summary statistics:
- Use dataset.data.describe() to get a statistical summary of the dataset.
Display the first few rows of the dataset:
- Print the initial rows using dataset.data.head() to understand the structure and content.

This example demonstrates how to load and explore the Iris dataset using scikit-learn’s fetch_openml() function, allowing you to inspect the data’s shape, types, and summary statistics. This sets the stage for further preprocessing and application of classification algorithms.

See Also