The make_gaussian_quantiles
function generates a synthetic dataset suitable for classification tasks. This dataset is created by drawing samples from multivariate normal distributions and then assigning labels based on quantiles of the distribution.
Key function arguments include n_samples
to specify the number of samples, n_features
for the number of features, and n_classes
to determine the number of classes.
This is a multiclass classification problem where algorithms like Logistic Regression, K-Nearest Neighbors, and Decision Trees can be applied.
from sklearn.datasets import make_gaussian_quantiles
import matplotlib.pyplot as plt
import pandas as pd
# Generate the dataset
X, y = make_gaussian_quantiles(n_samples=1000, n_features=2, n_classes=3, random_state=42)
# Display dataset shape and types
print(f"Dataset shape: {X.shape}")
print(f"Input feature types: {type(X)}, Output feature types: {type(y)}")
# Show summary statistics
print(f"Summary statistics:\n{pd.DataFrame(X).describe()}")
# Display first few rows of the dataset
print(f"First few rows of the dataset:\n{pd.DataFrame(X).head()}")
# Plot the dataset
plt.scatter(X[:, 0], X[:, 1], c=y, edgecolor='k', cmap=plt.cm.Paired)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Gaussian Quantiles Dataset')
plt.show()
Running the example gives an output like:
Dataset shape: (1000, 2)
Input feature types: <class 'numpy.ndarray'>, Output feature types: <class 'numpy.ndarray'>
Summary statistics:
0 1
count 1000.000000 1000.000000
mean 0.033186 0.056982
std 0.961603 1.014959
min -3.241267 -2.940389
25% -0.611581 -0.651418
50% 0.036043 0.047742
75% 0.648317 0.714886
max 3.078881 3.852731
First few rows of the dataset:
0 1
0 1.644968 -0.249036
1 1.189470 -1.227608
2 0.069802 -0.385314
3 1.846637 -1.070085
4 0.361636 -0.645120
The steps are as follows:
Import the
make_gaussian_quantiles
function fromsklearn.datasets
andmatplotlib.pyplot
for plotting:- This function allows us to generate a synthetic dataset with Gaussian quantiles.
Generate the dataset using
make_gaussian_quantiles()
:- Use
n_samples
to specify the number of samples (e.g., 1000). - Use
n_features
to determine the number of features (e.g., 2 for easy visualization). - Use
n_classes
to specify the number of classes (e.g., 3). - Set
random_state
for reproducibility.
- Use
Print the dataset shape and types:
- Access the shape using
X.shape
. - Show the data types using
type(X)
for input andtype(y)
for output.
- Access the shape using
Display summary statistics:
- Use
pd.DataFrame(X).describe()
to get a statistical summary of the dataset.
- Use
Display the first few rows of the dataset:
- Print the initial rows using
pd.DataFrame(X).head()
to get a sense of the dataset structure and content.
- Print the initial rows using
Plot the dataset:
- Use
plt.scatter()
to visualize the data points in a 2D space, colored by their class labels. - Set labels and title for the plot.
- Use
This example demonstrates how to quickly generate and explore a synthetic dataset using scikit-learn’s make_gaussian_quantiles()
function, allowing you to inspect the data’s shape, types, summary statistics, and visualize it. This sets the stage for further preprocessing and application of classification algorithms.