Scikit-Learn load_diabetes() Dataset

Datasets

The Diabetes dataset is frequently used for regression tasks to predict disease progression based on various medical features.

Key function arguments when loading the dataset include return_X_y to specify if data should be returned as a tuple, and as_frame to get the data as a pandas DataFrame.

This is a regression problem where algorithms like Linear Regression, Decision Trees, and Support Vector Machines are commonly applied.

from sklearn.datasets import load_diabetes

# Load the dataset
dataset = load_diabetes(as_frame=True)

# Display dataset shape and types
print(f"Dataset shape: {dataset.data.shape}")
print(f"Feature types:\n{dataset.data.dtypes}")

# Show summary statistics
print(f"Summary statistics:\n{dataset.data.describe()}")

# Display first few rows of the dataset
print(f"First few rows of the dataset:\n{dataset.data.head()}")

# Split the dataset into input and output elements
X = dataset.data
y = dataset.target
print(f"Input shape: {X.shape}")
print(f"Output shape: {y.shape}")

Running the example gives an output like:

Dataset shape: (442, 10)
Feature types:
age    float64
sex    float64
bmi    float64
bp     float64
s1     float64
s2     float64
s3     float64
s4     float64
s5     float64
s6     float64
dtype: object
Summary statistics:
                age           sex  ...            s5            s6
count  4.420000e+02  4.420000e+02  ...  4.420000e+02  4.420000e+02
mean  -2.511817e-19  1.230790e-17  ...  9.293722e-17  1.130318e-17
std    4.761905e-02  4.761905e-02  ...  4.761905e-02  4.761905e-02
min   -1.072256e-01 -4.464164e-02  ... -1.260971e-01 -1.377672e-01
25%   -3.729927e-02 -4.464164e-02  ... -3.324559e-02 -3.317903e-02
50%    5.383060e-03 -4.464164e-02  ... -1.947171e-03 -1.077698e-03
75%    3.807591e-02  5.068012e-02  ...  3.243232e-02  2.791705e-02
max    1.107267e-01  5.068012e-02  ...  1.335973e-01  1.356118e-01

[8 rows x 10 columns]
First few rows of the dataset:
        age       sex       bmi  ...        s4        s5        s6
0  0.038076  0.050680  0.061696  ... -0.002592  0.019907 -0.017646
1 -0.001882 -0.044642 -0.051474  ... -0.039493 -0.068332 -0.092204
2  0.085299  0.050680  0.044451  ... -0.002592  0.002861 -0.025930
3 -0.089063 -0.044642 -0.011595  ...  0.034309  0.022688 -0.009362
4  0.005383 -0.044642 -0.036385  ... -0.002592 -0.031988 -0.046641

[5 rows x 10 columns]
Input shape: (442, 10)
Output shape: (442,)

The steps are as follows:

Import the load_diabetes function from sklearn.datasets:
- This function allows us to load the Diabetes dataset directly from the scikit-learn library.
Load the dataset using load_diabetes():
- Use as_frame=True to return the dataset as a pandas DataFrame for easier data manipulation and analysis.
Print the dataset shape and feature types:
- Access the shape using dataset.data.shape.
- Show the data types of the features using dataset.data.dtypes.
Display summary statistics:
- Use dataset.data.describe() to get a statistical summary of the dataset.
Display the first few rows of the dataset:
- Print the initial rows using dataset.data.head() to get a sense of the dataset structure and content.
Split the dataset into input and output elements:
- Separate the features (X) from the target variable (y).
- Print the shapes of X and y to confirm the split.

This example demonstrates how to quickly load and explore the Diabetes dataset using scikit-learn’s load_diabetes() function, allowing you to inspect the data’s shape, types, summary statistics, and visualize key features. This sets the stage for further preprocessing and application of regression algorithms.

See Also