Scikit-Learn make_gaussian_quantiles() Dataset

Datasets

The make_gaussian_quantiles function generates a synthetic dataset suitable for classification tasks. This dataset is created by drawing samples from multivariate normal distributions and then assigning labels based on quantiles of the distribution.

Key function arguments include n_samples to specify the number of samples, n_features for the number of features, and n_classes to determine the number of classes.

This is a multiclass classification problem where algorithms like Logistic Regression, K-Nearest Neighbors, and Decision Trees can be applied.

from sklearn.datasets import make_gaussian_quantiles
import matplotlib.pyplot as plt
import pandas as pd

# Generate the dataset
X, y = make_gaussian_quantiles(n_samples=1000, n_features=2, n_classes=3, random_state=42)

# Display dataset shape and types
print(f"Dataset shape: {X.shape}")
print(f"Input feature types: {type(X)}, Output feature types: {type(y)}")

# Show summary statistics
print(f"Summary statistics:\n{pd.DataFrame(X).describe()}")

# Display first few rows of the dataset
print(f"First few rows of the dataset:\n{pd.DataFrame(X).head()}")

# Plot the dataset
plt.scatter(X[:, 0], X[:, 1], c=y, edgecolor='k', cmap=plt.cm.Paired)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Gaussian Quantiles Dataset')
plt.show()

Running the example gives an output like:

Dataset shape: (1000, 2)
Input feature types: <class 'numpy.ndarray'>, Output feature types: <class 'numpy.ndarray'>
Summary statistics:
                 0            1
count  1000.000000  1000.000000
mean      0.033186     0.056982
std       0.961603     1.014959
min      -3.241267    -2.940389
25%      -0.611581    -0.651418
50%       0.036043     0.047742
75%       0.648317     0.714886
max       3.078881     3.852731
First few rows of the dataset:
          0         1
0  1.644968 -0.249036
1  1.189470 -1.227608
2  0.069802 -0.385314
3  1.846637 -1.070085
4  0.361636 -0.645120

Scikit-Learn make_gaussian_quantiles() Dataset

The steps are as follows:

Import the make_gaussian_quantiles function from sklearn.datasets and matplotlib.pyplot for plotting:
- This function allows us to generate a synthetic dataset with Gaussian quantiles.
Generate the dataset using make_gaussian_quantiles():
- Use n_samples to specify the number of samples (e.g., 1000).
- Use n_features to determine the number of features (e.g., 2 for easy visualization).
- Use n_classes to specify the number of classes (e.g., 3).
- Set random_state for reproducibility.
Print the dataset shape and types:
- Access the shape using X.shape.
- Show the data types using type(X) for input and type(y) for output.
Display summary statistics:
- Use pd.DataFrame(X).describe() to get a statistical summary of the dataset.
Display the first few rows of the dataset:
- Print the initial rows using pd.DataFrame(X).head() to get a sense of the dataset structure and content.
Plot the dataset:
- Use plt.scatter() to visualize the data points in a 2D space, colored by their class labels.
- Set labels and title for the plot.

This example demonstrates how to quickly generate and explore a synthetic dataset using scikit-learn’s make_gaussian_quantiles() function, allowing you to inspect the data’s shape, types, summary statistics, and visualize it. This sets the stage for further preprocessing and application of classification algorithms.

See Also