Scikit-Learn load_wine() Dataset

Datasets

The Wine dataset is commonly used for classification tasks to predict the type of wine based on various chemical properties.

Key function arguments when loading the dataset include return_X_y to specify if data should be returned as a tuple and as_frame to get the data as a pandas DataFrame.

This is a multiclass classification problem where algorithms like Logistic Regression, Support Vector Machines, and Random Forests are often applied.

from sklearn.datasets import load_wine

# Load the dataset
dataset = load_wine(as_frame=True)

# Display dataset shape and types
print(f"Dataset shape: {dataset.data.shape}")
print(f"Feature types:\n{dataset.data.dtypes}")

# Show summary statistics
print(f"Summary statistics:\n{dataset.data.describe()}")

# Display first few rows of the dataset
print(f"First few rows of the dataset:\n{dataset.data.head()}")

# Split the dataset into input and output elements
X = dataset.data
y = dataset.target
print(f"Input shape: {X.shape}")
print(f"Output shape: {y.shape}")

Running the example gives an output like:

Dataset shape: (178, 13)
Feature types:
alcohol                         float64
malic_acid                      float64
ash                             float64
alcalinity_of_ash               float64
magnesium                       float64
total_phenols                   float64
flavanoids                      float64
nonflavanoid_phenols            float64
proanthocyanins                 float64
color_intensity                 float64
hue                             float64
od280/od315_of_diluted_wines    float64
proline                         float64
dtype: object
Summary statistics:
          alcohol  malic_acid  ...  od280/od315_of_diluted_wines      proline
count  178.000000  178.000000  ...                    178.000000   178.000000
mean    13.000618    2.336348  ...                      2.611685   746.893258
std      0.811827    1.117146  ...                      0.709990   314.907474
min     11.030000    0.740000  ...                      1.270000   278.000000
25%     12.362500    1.602500  ...                      1.937500   500.500000
50%     13.050000    1.865000  ...                      2.780000   673.500000
75%     13.677500    3.082500  ...                      3.170000   985.000000
max     14.830000    5.800000  ...                      4.000000  1680.000000

[8 rows x 13 columns]
First few rows of the dataset:
   alcohol  malic_acid   ash  ...   hue  od280/od315_of_diluted_wines  proline
0    14.23        1.71  2.43  ...  1.04                          3.92   1065.0
1    13.20        1.78  2.14  ...  1.05                          3.40   1050.0
2    13.16        2.36  2.67  ...  1.03                          3.17   1185.0
3    14.37        1.95  2.50  ...  0.86                          3.45   1480.0
4    13.24        2.59  2.87  ...  1.04                          2.93    735.0

[5 rows x 13 columns]
Input shape: (178, 13)
Output shape: (178,)

The steps are as follows:

Import the load_wine function from sklearn.datasets:
- This function allows us to load the Wine dataset directly from the scikit-learn library.
Load the dataset using load_wine():
- Use as_frame=True to return the dataset as a pandas DataFrame for easier data manipulation and analysis.
Print the dataset shape and feature types:
- Access the shape using dataset.data.shape.
- Show the data types of the features using dataset.data.dtypes.
Display summary statistics:
- Use dataset.data.describe() to get a statistical summary of the dataset.
Display the first few rows of the dataset:
- Print the initial rows using dataset.data.head() to get a sense of the dataset structure and content.
Split the dataset into input and output elements:
- Separate the features (X) from the target variable (y).
- Print the shapes of X and y to confirm the split.

This example demonstrates how to quickly load and explore the Wine dataset using scikit-learn’s load_wine() function, allowing you to inspect the data’s shape, types, summary statistics, and visualize a key feature. This sets the stage for further preprocessing and application of classification algorithms.

See Also