The Covertype dataset contains information on forest cover types obtained from cartographic variables. It is used for classification tasks to predict the type of forest cover based on features like elevation, aspect, and soil type.
Key function arguments when loading the dataset include return_X_y
to specify if data should be returned as a tuple, and as_frame
to get the data as a pandas DataFrame.
This is a classification problem where common algorithms like Logistic Regression, Decision Trees, and Random Forests are often applied.
from sklearn.datasets import fetch_covtype
# Fetch the dataset
dataset = fetch_covtype(as_frame=True)
# Display dataset shape and types
print(f"Dataset shape: {dataset.data.shape}")
print(f"Feature types:\n{dataset.data.dtypes}")
# Show summary statistics
print(f"Summary statistics:\n{dataset.data.describe()}")
# Display first few rows of the dataset
print(f"First few rows of the dataset:\n{dataset.data.head()}")
Running the example gives an output like:
Dataset shape: (581012, 54)
Feature types:
Elevation float64
Aspect float64
Slope float64
Horizontal_Distance_To_Hydrology float64
Vertical_Distance_To_Hydrology float64
Horizontal_Distance_To_Roadways float64
Hillshade_9am float64
Hillshade_Noon float64
Hillshade_3pm float64
Horizontal_Distance_To_Fire_Points float64
Wilderness_Area_0 float64
Wilderness_Area_1 float64
Wilderness_Area_2 float64
Wilderness_Area_3 float64
Soil_Type_0 float64
Soil_Type_1 float64
Soil_Type_2 float64
Soil_Type_3 float64
Soil_Type_4 float64
Soil_Type_5 float64
Soil_Type_6 float64
Soil_Type_7 float64
Soil_Type_8 float64
Soil_Type_9 float64
Soil_Type_10 float64
Soil_Type_11 float64
Soil_Type_12 float64
Soil_Type_13 float64
Soil_Type_14 float64
Soil_Type_15 float64
Soil_Type_16 float64
Soil_Type_17 float64
Soil_Type_18 float64
Soil_Type_19 float64
Soil_Type_20 float64
Soil_Type_21 float64
Soil_Type_22 float64
Soil_Type_23 float64
Soil_Type_24 float64
Soil_Type_25 float64
Soil_Type_26 float64
Soil_Type_27 float64
Soil_Type_28 float64
Soil_Type_29 float64
Soil_Type_30 float64
Soil_Type_31 float64
Soil_Type_32 float64
Soil_Type_33 float64
Soil_Type_34 float64
Soil_Type_35 float64
Soil_Type_36 float64
Soil_Type_37 float64
Soil_Type_38 float64
Soil_Type_39 float64
dtype: object
Summary statistics:
Elevation Aspect ... Soil_Type_38 Soil_Type_39
count 581012.000000 581012.000000 ... 581012.000000 581012.000000
mean 2959.365301 155.656807 ... 0.023762 0.015060
std 279.984734 111.913721 ... 0.152307 0.121791
min 1859.000000 0.000000 ... 0.000000 0.000000
25% 2809.000000 58.000000 ... 0.000000 0.000000
50% 2996.000000 127.000000 ... 0.000000 0.000000
75% 3163.000000 260.000000 ... 0.000000 0.000000
max 3858.000000 360.000000 ... 1.000000 1.000000
[8 rows x 54 columns]
First few rows of the dataset:
Elevation Aspect Slope ... Soil_Type_37 Soil_Type_38 Soil_Type_39
0 2596.0 51.0 3.0 ... 0.0 0.0 0.0
1 2590.0 56.0 2.0 ... 0.0 0.0 0.0
2 2804.0 139.0 9.0 ... 0.0 0.0 0.0
3 2785.0 155.0 18.0 ... 0.0 0.0 0.0
4 2595.0 45.0 2.0 ... 0.0 0.0 0.0
[5 rows x 54 columns]
The steps are as follows:
Import the
fetch_covtype
function fromsklearn.datasets
:- This function allows us to load the Covertype dataset directly from the scikit-learn library.
Fetch the dataset using
fetch_covtype()
:- Use
as_frame=True
to return the dataset as a pandas DataFrame for easier data manipulation and analysis.
- Use
Print the dataset shape and feature types:
- Access the shape using
dataset.data.shape
. - Show the data types of the features using
dataset.data.dtypes
.
- Access the shape using
Display summary statistics:
- Use
dataset.data.describe()
to get a statistical summary of the dataset.
- Use
Display the first few rows of the dataset:
- Print the initial rows using
dataset.data.head()
to get a sense of the dataset structure and content.
- Print the initial rows using
This example demonstrates how to quickly load and explore the Covertype dataset using scikit-learn’s fetch_covtype()
function, allowing you to inspect the data’s shape, types, summary statistics, and visualize key features. This sets the stage for further preprocessing and application of classification algorithms.