Scikit-Learn ShuffleSplit Data Splitting

ShuffleSplit is a cross-validation technique for randomly splitting a dataset into train and test sets. It allows specifying the test set size and the number of splitting iterations.

ShuffleSplit is appropriate for evaluating the performance of machine learning models, particularly when working with datasets that are not explicitly ordered.

from sklearn.datasets import make_classification
from sklearn.model_selection import ShuffleSplit

# generate binary classification dataset
X, y = make_classification(n_samples=100, n_features=5, n_classes=2, random_state=1)

# create shuffle split
ss = ShuffleSplit(n_splits=5, test_size=0.2, random_state=1)

# enumerate splits
for train_index, test_index in ss.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    print('train: %s, test: %s' % (X_train.shape, X_test.shape))

Running the example gives an output like:

train: (80, 5), test: (20, 5)
train: (80, 5), test: (20, 5)
train: (80, 5), test: (20, 5)
train: (80, 5), test: (20, 5)
train: (80, 5), test: (20, 5)

The steps in this example are:

First, a synthetic binary classification dataset is generated using the make_classification() function.
Next, a ShuffleSplit object is created, specifying the desired number of splits (n_splits) and the size of the test set (test_size). Setting a random_state ensures reproducibility.
The split() method of the ShuffleSplit object is then used to iterate over the splits. For each split, the indices of the training and test sets are printed, and the corresponding X and y data are extracted using these indices. The shapes of the resulting X_train and X_test are also printed, confirming the sizes of the train and test sets for each split.

This example demonstrates how to use ShuffleSplit for randomly splitting a dataset into train and test sets, which is useful for evaluating the performance of machine learning models. The test_size and n_splits parameters provide control over the size of the test set and the number of times the dataset is split, respectively.

See Also