Scikit-Learn load_iris() Dataset

Datasets

The Iris dataset is a well-known dataset used for classification tasks to identify the species of iris flowers based on their features.

Key function arguments when loading the dataset include return_X_y to specify if data should be returned as a tuple, and as_frame to get the data as a pandas DataFrame.

This is a multi-class classification problem where algorithms like Decision Trees, k-Nearest Neighbors, and Support Vector Machines are commonly applied.

from sklearn.datasets import load_iris

# Load the dataset
dataset = load_iris(as_frame=True)

# Display dataset shape and types
print(f"Dataset shape: {dataset.data.shape}")
print(f"Feature types:\n{dataset.data.dtypes}")

# Show summary statistics
print(f"Summary statistics:\n{dataset.data.describe()}")

# Display first few rows of the dataset
print(f"First few rows of the dataset:\n{dataset.data.head()}")

# Split the dataset into input and output elements
X = dataset.data
y = dataset.target
print(f"Input shape: {X.shape}")
print(f"Output shape: {y.shape}")

Running the example gives an output like:

Dataset shape: (150, 4)
Feature types:
sepal length (cm)    float64
sepal width (cm)     float64
petal length (cm)    float64
petal width (cm)     float64
dtype: object
Summary statistics:
       sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
count         150.000000        150.000000         150.000000        150.000000
mean            5.843333          3.057333           3.758000          1.199333
std             0.828066          0.435866           1.765298          0.762238
min             4.300000          2.000000           1.000000          0.100000
25%             5.100000          2.800000           1.600000          0.300000
50%             5.800000          3.000000           4.350000          1.300000
75%             6.400000          3.300000           5.100000          1.800000
max             7.900000          4.400000           6.900000          2.500000
First few rows of the dataset:
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                5.1               3.5                1.4               0.2
1                4.9               3.0                1.4               0.2
2                4.7               3.2                1.3               0.2
3                4.6               3.1                1.5               0.2
4                5.0               3.6                1.4               0.2
Input shape: (150, 4)
Output shape: (150,)

Import the load_iris function from sklearn.datasets:
- This function allows us to load the Iris dataset directly from the scikit-learn library.
Load the dataset using load_iris():
- Use as_frame=True to return the dataset as a pandas DataFrame for easier data manipulation and analysis.
Print the dataset shape and feature types:
- Access the shape using dataset.data.shape.
- Show the data types of the features using dataset.data.dtypes.
Display summary statistics:
- Use dataset.data.describe() to get a statistical summary of the dataset.
Display the first few rows of the dataset:
- Print the initial rows using dataset.data.head() to get a sense of the dataset structure and content.
Split the dataset into input and output elements:
- Separate the features (X) from the target variable (y).
- Print the shapes of X and y to confirm the split.

This example demonstrates how to quickly load and explore the Iris dataset using scikit-learn’s load_iris() function, allowing you to inspect the data’s shape, types, summary statistics, and visualize a key feature. This sets the stage for further preprocessing and application of classification algorithms.

See Also