SKLearner Home | About | Contact | Examples

Scikit-Learn train_test_split() Data Splitting

Splitting a dataset into train and test sets is a crucial step in evaluating machine learning models. It allows you to train the model on a portion of the data and test its performance on unseen data.

The train_test_split function in scikit-learn provides an easy way to perform this split for both classification and regression datasets.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# generate classification dataset
X, y = make_classification(n_samples=100, n_classes=2, random_state=1)

# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

# summarize the split dataset
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

Running the example prints the number of rows in the train and test sets.

(70, 20) (30, 20) (70,) (30,)

The steps are as follows:

  1. First, a synthetic binary classification dataset is generated using the make_classification() function.

  2. The dataset is split into train and test sets using the train_test_split() function. The test_size parameter is set to 0.3, indicating that 30% of the data should be used for the test set. The random_state parameter is set to ensure the same split is obtained each time the code is run.

  3. Finally, the shape of the resulting train and test sets is printed, confirming that the split was performed correctly.

The 70-30 split used here is a common choice, but the best split ratio will depend on the size of the dataset and the specific requirements of the project. It’s important to have enough data in the train set to build a good model, while also having enough data in the test set to get a reliable estimate of model performance.



See Also