Configuring the `cache_size` parameter in scikit-learn’s `SVR`

The cache_size parameter in scikit-learn’s SVR (Support Vector Regression) controls the size of the kernel cache, which is used to speed up computations.

Support Vector Regression is a regression algorithm that tries to find a hyperplane in a high-dimensional space that fits the training data while keeping the margin as wide as possible.

The cache_size parameter specifies the size of the kernel cache in MB. A larger cache allows for faster training, especially when dealing with large datasets, but it also consumes more memory.

The default value for cache_size is 200 MB.

In practice, values between 100 and 1000 are commonly used depending on the size of the dataset and available system memory.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error
from time import time

# Generate synthetic dataset
X, y = make_regression(n_samples=10000, n_features=100, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different cache_size values
cache_size_values = [100, 200, 500, 1000]
mse_scores = []
train_times = []

for cache_size in cache_size_values:
    start_time = time()
    svr = SVR(kernel='rbf', cache_size=cache_size)
    svr.fit(X_train, y_train)
    train_time = time() - start_time
    train_times.append(train_time)

    y_pred = svr.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_scores.append(mse)

    print(f"cache_size={cache_size} MB, Train time: {train_time:.2f}s, MSE: {mse:.3f}")

Running the example gives an output like:

cache_size=100 MB, Train time: 4.51s, MSE: 30732.442
cache_size=200 MB, Train time: 4.43s, MSE: 30732.442
cache_size=500 MB, Train time: 4.26s, MSE: 30732.442
cache_size=1000 MB, Train time: 5.16s, MSE: 30732.442

The key steps in this example are:

Generate a synthetic regression dataset with a large number of samples and features
Split the data into train and test sets
Train SVR models with different cache_size values
Evaluate the mean squared error and training time for each model

Some tips and heuristics for setting cache_size:

Increase cache_size if you have a large dataset and sufficient memory to speed up training
Decrease cache_size if you are running into memory issues during training
The optimal value depends on the size of your dataset and available system memory

Issues to consider:

Setting cache_size too low can lead to longer training times
Setting cache_size too high can cause out-of-memory errors if your system doesn’t have enough RAM
The impact of cache_size on training time is more significant for larger datasets

Configure SVR "cache_size" Parameter

Configuring the `cache_size` parameter in scikit-learn’s `SVR`

See Also

Configuring the cache_size parameter in scikit-learn’s SVR

See Also

Configuring the `cache_size` parameter in scikit-learn’s `SVR`