TimeSeriesSplit is a cross-validation technique designed for time series data. It splits the data into train and test sets while preserving the temporal order, which is crucial for evaluating machine learning models on time series problems.
The key hyperparameter of TimeSeriesSplit
is n_splits
, which determines the number of splitting iterations. The max_train_size
parameter can be used to limit the size of the train set.
This cross-validation method is appropriate for regression and classification problems involving time series data.
from sklearn.model_selection import TimeSeriesSplit
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np
# generate synthetic time series dataset
X = np.array([[i] for i in range(10)])
y = np.array([i + np.random.randn() for i in range(10)])
# create TimeSeriesSplit object
tscv = TimeSeriesSplit(n_splits=3)
# loop through each split
for train_index, test_index in tscv.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# create and fit model
model = LinearRegression()
model.fit(X_train, y_train)
# evaluate model
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
# print results
print(f"Train: {train_index}, Test: {test_index}")
print(f"Test MSE: {mse:.3f}")
Running the example gives an output like:
Train: [0 1 2 3], Test: [4 5]
Test MSE: 3.795
Train: [0 1 2 3 4 5], Test: [6 7]
Test MSE: 1.506
Train: [0 1 2 3 4 5 6 7], Test: [8 9]
Test MSE: 1.719
The steps are as follows:
First, a synthetic time series dataset is generated using NumPy. The
X
variable represents the time steps, andy
represents the corresponding values with added random noise.A
TimeSeriesSplit
object is created withn_splits=3
, which means the data will be split into three train and test sets.The code then loops through each split using the
split()
method of theTimeSeriesSplit
object. For each iteration, thetrain_index
andtest_index
are used to splitX
andy
into train and test sets.A linear regression model is created and fit on the train set using
fit()
. The model’s performance is then evaluated on the test set usingpredict()
andmean_squared_error()
.Finally, the train and test indices and the test mean squared error are printed for each split.
This example demonstrates how to use TimeSeriesSplit
for cross-validation on time series data. By preserving the temporal order of the data during splitting, it provides a more reliable evaluation of the model’s performance compared to traditional cross-validation methods.