The Breast Cancer dataset is a classic dataset commonly used for classification tasks to predict whether a tumor is malignant or benign based on various features.
Key function arguments when loading the dataset include return_X_y
to specify if data should be returned as a tuple, and as_frame
to get the data as a pandas DataFrame.
This is a binary classification problem where common algorithms like Logistic Regression, Support Vector Machines, and Random Forests are often applied.
from sklearn.datasets import load_breast_cancer
# Load the dataset
dataset = load_breast_cancer(as_frame=True)
# Display dataset shape and types
print(f"Dataset shape: {dataset.data.shape}")
print(f"Feature types:\n{dataset.data.dtypes}")
# Show summary statistics
print(f"Summary statistics:\n{dataset.data.describe()}")
# Display first few rows of the dataset
print(f"First few rows of the dataset:\n{dataset.data.head()}")
# Split the dataset into input and output elements
X = dataset.data
y = dataset.target
print(f"Input shape: {X.shape}")
print(f"Output shape: {y.shape}")
Running the example gives an output like:
Dataset shape: (569, 30)
Feature types:
mean radius float64
mean texture float64
mean perimeter float64
mean area float64
mean smoothness float64
mean compactness float64
mean concavity float64
mean concave points float64
mean symmetry float64
mean fractal dimension float64
radius error float64
texture error float64
perimeter error float64
area error float64
smoothness error float64
compactness error float64
concavity error float64
concave points error float64
symmetry error float64
fractal dimension error float64
worst radius float64
worst texture float64
worst perimeter float64
worst area float64
worst smoothness float64
worst compactness float64
worst concavity float64
worst concave points float64
worst symmetry float64
worst fractal dimension float64
dtype: object
Summary statistics:
mean radius mean texture ... worst symmetry worst fractal dimension
count 569.000000 569.000000 ... 569.000000 569.000000
mean 14.127292 19.289649 ... 0.290076 0.083946
std 3.524049 4.301036 ... 0.061867 0.018061
min 6.981000 9.710000 ... 0.156500 0.055040
25% 11.700000 16.170000 ... 0.250400 0.071460
50% 13.370000 18.840000 ... 0.282200 0.080040
75% 15.780000 21.800000 ... 0.317900 0.092080
max 28.110000 39.280000 ... 0.663800 0.207500
[8 rows x 30 columns]
First few rows of the dataset:
mean radius mean texture ... worst symmetry worst fractal dimension
0 17.99 10.38 ... 0.4601 0.11890
1 20.57 17.77 ... 0.2750 0.08902
2 19.69 21.25 ... 0.3613 0.08758
3 11.42 20.38 ... 0.6638 0.17300
4 20.29 14.34 ... 0.2364 0.07678
[5 rows x 30 columns]
Input shape: (569, 30)
Output shape: (569,)
The steps are as follows:
Import the
load_breast_cancer
function fromsklearn.datasets
:- This function allows us to load the Breast Cancer dataset directly from the scikit-learn library.
Load the dataset using
load_breast_cancer()
:- Use
as_frame=True
to return the dataset as a pandas DataFrame for easier data manipulation and analysis.
- Use
Print the dataset shape and feature types:
- Access the shape using
dataset.data.shape
. - Show the data types of the features using
dataset.data.dtypes
.
- Access the shape using
Display summary statistics:
- Use
dataset.data.describe()
to get a statistical summary of the dataset.
- Use
Display the first few rows of the dataset:
- Print the initial rows using
dataset.data.head()
to get a sense of the dataset structure and content.
- Print the initial rows using
Split the dataset into input and output elements:
- Separate the features (
X
) from the target variable (y
). - Print the shapes of
X
andy
to confirm the split.
- Separate the features (
This example demonstrates how to quickly load and explore the Breast Cancer dataset using scikit-learn’s load_breast_cancer()
function, allowing you to inspect the data’s shape, types, summary statistics, and visualize a key feature. This sets the stage for further preprocessing and application of classification algorithms.