The KDD Cup 1999 dataset consists of network connection records used in the 1999 KDD Cup competition for the task of network intrusion detection.
Key function arguments when loading the dataset include subset
to specify the subset of the data to fetch ('SA'
, 'SF'
, or None
for the full dataset), and return_X_y
to get the features and labels as separate objects.
This is a classification problem where common algorithms like Support Vector Machines (SVM), Decision Trees, and Random Forests are often applied.
from sklearn.datasets import fetch_kddcup99
# Fetch the dataset
dataset = fetch_kddcup99(subset='SA', as_frame=True)
# Display dataset shape and types
print(f"Dataset shape: {dataset.data.shape}")
print(f"Feature types:\n{dataset.data.dtypes}")
# Show summary statistics
print(f"Summary statistics:\n{dataset.data.describe()}")
# Display first few rows of the dataset
print(f"First few rows of the dataset:\n{dataset.data.head()}")
Running the example gives an output like:
Dataset shape: (100655, 41)
Feature types:
duration object
protocol_type object
service object
flag object
src_bytes object
dst_bytes object
land object
wrong_fragment object
urgent object
hot object
num_failed_logins object
logged_in object
num_compromised object
root_shell object
su_attempted object
num_root object
num_file_creations object
num_shells object
num_access_files object
num_outbound_cmds object
is_host_login object
is_guest_login object
count object
srv_count object
serror_rate object
srv_serror_rate object
rerror_rate object
srv_rerror_rate object
same_srv_rate object
diff_srv_rate object
srv_diff_host_rate object
dst_host_count object
dst_host_srv_count object
dst_host_same_srv_rate object
dst_host_diff_srv_rate object
dst_host_same_src_port_rate object
dst_host_srv_diff_host_rate object
dst_host_serror_rate object
dst_host_srv_serror_rate object
dst_host_rerror_rate object
dst_host_srv_rerror_rate object
dtype: object
Summary statistics:
duration protocol_type ... dst_host_rerror_rate dst_host_srv_rerror_rate
count 100655 100655 ... 100655.0 100655.0
unique 2354 3 ... 101.0 101.0
top 0 b'tcp' ... 0.0 0.0
freq 88959 77771 ... 91144.0 91227.0
[4 rows x 41 columns]
First few rows of the dataset:
duration protocol_type ... dst_host_rerror_rate dst_host_srv_rerror_rate
0 0 b'tcp' ... 0.0 0.0
1 0 b'tcp' ... 0.0 0.0
2 0 b'tcp' ... 0.0 0.0
3 0 b'tcp' ... 0.0 0.0
4 0 b'tcp' ... 0.0 0.0
[5 rows x 41 columns]
The steps are as follows:
Import the
fetch_kddcup99
function fromsklearn.datasets
:- This function allows us to load the KDD Cup 1999 dataset directly from the scikit-learn library.
Fetch the dataset using
fetch_kddcup99(subset='SA', as_frame=True)
:- Use the
subset
argument to specify the subset of the dataset to fetch, with'SA'
being one of the available subsets. - Use
as_frame=True
to return the dataset as a pandas DataFrame for easier data manipulation and analysis.
- Use the
Print the dataset shape and feature types:
- Access the shape using
dataset.data.shape
. - Show the data types of the features using
dataset.data.dtypes
.
- Access the shape using
Display summary statistics:
- Use
dataset.data.describe()
to get a statistical summary of the dataset.
- Use
Display the first few rows of the dataset:
- Print the initial rows using
dataset.data.head()
to get a sense of the dataset structure and content.
- Print the initial rows using
This example demonstrates how to quickly load and explore the KDD Cup 1999 dataset using scikit-learn’s fetch_kddcup99()
function, allowing you to inspect the data’s shape, types, summary statistics, and visualize a key feature. This sets the stage for further preprocessing and application of classification algorithms.