The Diabetes dataset is frequently used for regression tasks to predict disease progression based on various medical features.
Key function arguments when loading the dataset include return_X_y
to specify if data should be returned as a tuple, and as_frame
to get the data as a pandas DataFrame.
This is a regression problem where algorithms like Linear Regression, Decision Trees, and Support Vector Machines are commonly applied.
from sklearn.datasets import load_diabetes
# Load the dataset
dataset = load_diabetes(as_frame=True)
# Display dataset shape and types
print(f"Dataset shape: {dataset.data.shape}")
print(f"Feature types:\n{dataset.data.dtypes}")
# Show summary statistics
print(f"Summary statistics:\n{dataset.data.describe()}")
# Display first few rows of the dataset
print(f"First few rows of the dataset:\n{dataset.data.head()}")
# Split the dataset into input and output elements
X = dataset.data
y = dataset.target
print(f"Input shape: {X.shape}")
print(f"Output shape: {y.shape}")
Running the example gives an output like:
Dataset shape: (442, 10)
Feature types:
age float64
sex float64
bmi float64
bp float64
s1 float64
s2 float64
s3 float64
s4 float64
s5 float64
s6 float64
dtype: object
Summary statistics:
age sex ... s5 s6
count 4.420000e+02 4.420000e+02 ... 4.420000e+02 4.420000e+02
mean -2.511817e-19 1.230790e-17 ... 9.293722e-17 1.130318e-17
std 4.761905e-02 4.761905e-02 ... 4.761905e-02 4.761905e-02
min -1.072256e-01 -4.464164e-02 ... -1.260971e-01 -1.377672e-01
25% -3.729927e-02 -4.464164e-02 ... -3.324559e-02 -3.317903e-02
50% 5.383060e-03 -4.464164e-02 ... -1.947171e-03 -1.077698e-03
75% 3.807591e-02 5.068012e-02 ... 3.243232e-02 2.791705e-02
max 1.107267e-01 5.068012e-02 ... 1.335973e-01 1.356118e-01
[8 rows x 10 columns]
First few rows of the dataset:
age sex bmi ... s4 s5 s6
0 0.038076 0.050680 0.061696 ... -0.002592 0.019907 -0.017646
1 -0.001882 -0.044642 -0.051474 ... -0.039493 -0.068332 -0.092204
2 0.085299 0.050680 0.044451 ... -0.002592 0.002861 -0.025930
3 -0.089063 -0.044642 -0.011595 ... 0.034309 0.022688 -0.009362
4 0.005383 -0.044642 -0.036385 ... -0.002592 -0.031988 -0.046641
[5 rows x 10 columns]
Input shape: (442, 10)
Output shape: (442,)
The steps are as follows:
Import the
load_diabetes
function fromsklearn.datasets
:- This function allows us to load the Diabetes dataset directly from the scikit-learn library.
Load the dataset using
load_diabetes()
:- Use
as_frame=True
to return the dataset as a pandas DataFrame for easier data manipulation and analysis.
- Use
Print the dataset shape and feature types:
- Access the shape using
dataset.data.shape
. - Show the data types of the features using
dataset.data.dtypes
.
- Access the shape using
Display summary statistics:
- Use
dataset.data.describe()
to get a statistical summary of the dataset.
- Use
Display the first few rows of the dataset:
- Print the initial rows using
dataset.data.head()
to get a sense of the dataset structure and content.
- Print the initial rows using
Split the dataset into input and output elements:
- Separate the features (
X
) from the target variable (y
). - Print the shapes of
X
andy
to confirm the split.
- Separate the features (
This example demonstrates how to quickly load and explore the Diabetes dataset using scikit-learn’s load_diabetes()
function, allowing you to inspect the data’s shape, types, summary statistics, and visualize key features. This sets the stage for further preprocessing and application of regression algorithms.