Scikit-Learn fetch_covtype() Dataset

Datasets

The Covertype dataset contains information on forest cover types obtained from cartographic variables. It is used for classification tasks to predict the type of forest cover based on features like elevation, aspect, and soil type.

Key function arguments when loading the dataset include return_X_y to specify if data should be returned as a tuple, and as_frame to get the data as a pandas DataFrame.

This is a classification problem where common algorithms like Logistic Regression, Decision Trees, and Random Forests are often applied.

from sklearn.datasets import fetch_covtype

# Fetch the dataset
dataset = fetch_covtype(as_frame=True)

# Display dataset shape and types
print(f"Dataset shape: {dataset.data.shape}")
print(f"Feature types:\n{dataset.data.dtypes}")

# Show summary statistics
print(f"Summary statistics:\n{dataset.data.describe()}")

# Display first few rows of the dataset
print(f"First few rows of the dataset:\n{dataset.data.head()}")

Running the example gives an output like:

Dataset shape: (581012, 54)
Feature types:
Elevation                             float64
Aspect                                float64
Slope                                 float64
Horizontal_Distance_To_Hydrology      float64
Vertical_Distance_To_Hydrology        float64
Horizontal_Distance_To_Roadways       float64
Hillshade_9am                         float64
Hillshade_Noon                        float64
Hillshade_3pm                         float64
Horizontal_Distance_To_Fire_Points    float64
Wilderness_Area_0                     float64
Wilderness_Area_1                     float64
Wilderness_Area_2                     float64
Wilderness_Area_3                     float64
Soil_Type_0                           float64
Soil_Type_1                           float64
Soil_Type_2                           float64
Soil_Type_3                           float64
Soil_Type_4                           float64
Soil_Type_5                           float64
Soil_Type_6                           float64
Soil_Type_7                           float64
Soil_Type_8                           float64
Soil_Type_9                           float64
Soil_Type_10                          float64
Soil_Type_11                          float64
Soil_Type_12                          float64
Soil_Type_13                          float64
Soil_Type_14                          float64
Soil_Type_15                          float64
Soil_Type_16                          float64
Soil_Type_17                          float64
Soil_Type_18                          float64
Soil_Type_19                          float64
Soil_Type_20                          float64
Soil_Type_21                          float64
Soil_Type_22                          float64
Soil_Type_23                          float64
Soil_Type_24                          float64
Soil_Type_25                          float64
Soil_Type_26                          float64
Soil_Type_27                          float64
Soil_Type_28                          float64
Soil_Type_29                          float64
Soil_Type_30                          float64
Soil_Type_31                          float64
Soil_Type_32                          float64
Soil_Type_33                          float64
Soil_Type_34                          float64
Soil_Type_35                          float64
Soil_Type_36                          float64
Soil_Type_37                          float64
Soil_Type_38                          float64
Soil_Type_39                          float64
dtype: object
Summary statistics:
           Elevation         Aspect  ...   Soil_Type_38   Soil_Type_39
count  581012.000000  581012.000000  ...  581012.000000  581012.000000
mean     2959.365301     155.656807  ...       0.023762       0.015060
std       279.984734     111.913721  ...       0.152307       0.121791
min      1859.000000       0.000000  ...       0.000000       0.000000
25%      2809.000000      58.000000  ...       0.000000       0.000000
50%      2996.000000     127.000000  ...       0.000000       0.000000
75%      3163.000000     260.000000  ...       0.000000       0.000000
max      3858.000000     360.000000  ...       1.000000       1.000000

[8 rows x 54 columns]
First few rows of the dataset:
   Elevation  Aspect  Slope  ...  Soil_Type_37  Soil_Type_38  Soil_Type_39
0     2596.0    51.0    3.0  ...           0.0           0.0           0.0
1     2590.0    56.0    2.0  ...           0.0           0.0           0.0
2     2804.0   139.0    9.0  ...           0.0           0.0           0.0
3     2785.0   155.0   18.0  ...           0.0           0.0           0.0
4     2595.0    45.0    2.0  ...           0.0           0.0           0.0

[5 rows x 54 columns]

The steps are as follows:

Import the fetch_covtype function from sklearn.datasets:
- This function allows us to load the Covertype dataset directly from the scikit-learn library.
Fetch the dataset using fetch_covtype():
- Use as_frame=True to return the dataset as a pandas DataFrame for easier data manipulation and analysis.
Print the dataset shape and feature types:
- Access the shape using dataset.data.shape.
- Show the data types of the features using dataset.data.dtypes.
Display summary statistics:
- Use dataset.data.describe() to get a statistical summary of the dataset.
Display the first few rows of the dataset:
- Print the initial rows using dataset.data.head() to get a sense of the dataset structure and content.

This example demonstrates how to quickly load and explore the Covertype dataset using scikit-learn’s fetch_covtype() function, allowing you to inspect the data’s shape, types, summary statistics, and visualize key features. This sets the stage for further preprocessing and application of classification algorithms.

See Also