Scikit-Learn fetch_kddcup99() Dataset

Datasets

The KDD Cup 1999 dataset consists of network connection records used in the 1999 KDD Cup competition for the task of network intrusion detection.

Key function arguments when loading the dataset include subset to specify the subset of the data to fetch ('SA', 'SF', or None for the full dataset), and return_X_y to get the features and labels as separate objects.

This is a classification problem where common algorithms like Support Vector Machines (SVM), Decision Trees, and Random Forests are often applied.

from sklearn.datasets import fetch_kddcup99

# Fetch the dataset
dataset = fetch_kddcup99(subset='SA', as_frame=True)

# Display dataset shape and types
print(f"Dataset shape: {dataset.data.shape}")
print(f"Feature types:\n{dataset.data.dtypes}")

# Show summary statistics
print(f"Summary statistics:\n{dataset.data.describe()}")

# Display first few rows of the dataset
print(f"First few rows of the dataset:\n{dataset.data.head()}")

Running the example gives an output like:

Dataset shape: (100655, 41)
Feature types:
duration                       object
protocol_type                  object
service                        object
flag                           object
src_bytes                      object
dst_bytes                      object
land                           object
wrong_fragment                 object
urgent                         object
hot                            object
num_failed_logins              object
logged_in                      object
num_compromised                object
root_shell                     object
su_attempted                   object
num_root                       object
num_file_creations             object
num_shells                     object
num_access_files               object
num_outbound_cmds              object
is_host_login                  object
is_guest_login                 object
count                          object
srv_count                      object
serror_rate                    object
srv_serror_rate                object
rerror_rate                    object
srv_rerror_rate                object
same_srv_rate                  object
diff_srv_rate                  object
srv_diff_host_rate             object
dst_host_count                 object
dst_host_srv_count             object
dst_host_same_srv_rate         object
dst_host_diff_srv_rate         object
dst_host_same_src_port_rate    object
dst_host_srv_diff_host_rate    object
dst_host_serror_rate           object
dst_host_srv_serror_rate       object
dst_host_rerror_rate           object
dst_host_srv_rerror_rate       object
dtype: object
Summary statistics:
        duration protocol_type  ... dst_host_rerror_rate dst_host_srv_rerror_rate
count     100655        100655  ...             100655.0                 100655.0
unique      2354             3  ...                101.0                    101.0
top            0        b'tcp'  ...                  0.0                      0.0
freq       88959         77771  ...              91144.0                  91227.0

[4 rows x 41 columns]
First few rows of the dataset:
  duration protocol_type  ... dst_host_rerror_rate dst_host_srv_rerror_rate
0        0        b'tcp'  ...                  0.0                      0.0
1        0        b'tcp'  ...                  0.0                      0.0
2        0        b'tcp'  ...                  0.0                      0.0
3        0        b'tcp'  ...                  0.0                      0.0
4        0        b'tcp'  ...                  0.0                      0.0

[5 rows x 41 columns]

The steps are as follows:

Import the fetch_kddcup99 function from sklearn.datasets:
- This function allows us to load the KDD Cup 1999 dataset directly from the scikit-learn library.
Fetch the dataset using fetch_kddcup99(subset='SA', as_frame=True):
- Use the subset argument to specify the subset of the dataset to fetch, with 'SA' being one of the available subsets.
- Use as_frame=True to return the dataset as a pandas DataFrame for easier data manipulation and analysis.
Print the dataset shape and feature types:
- Access the shape using dataset.data.shape.
- Show the data types of the features using dataset.data.dtypes.
Display summary statistics:
- Use dataset.data.describe() to get a statistical summary of the dataset.
Display the first few rows of the dataset:
- Print the initial rows using dataset.data.head() to get a sense of the dataset structure and content.

This example demonstrates how to quickly load and explore the KDD Cup 1999 dataset using scikit-learn’s fetch_kddcup99() function, allowing you to inspect the data’s shape, types, summary statistics, and visualize a key feature. This sets the stage for further preprocessing and application of classification algorithms.

See Also