Configure Lasso "warm_start" Parameter

The warm_start parameter in scikit-learn’s Lasso class allows reusing the solution from the previous call to fit() as initialization for the next fit() operation. This can speed up convergence when fitting incrementally on very large datasets.

Lasso (Least Absolute Shrinkage and Selection Operator) is a linear regression algorithm that performs L1 regularization, which adds a penalty term to the loss function to encourage sparse solutions (i.e., many coefficients set to zero). This makes Lasso useful for feature selection and creating interpretable models.

The warm_start parameter is a boolean value that defaults to False. When set to True, it allows the model to be fitted incrementally on new data, continuing from the solution of the previous fit() call. This is particularly useful when working with datasets that are too large to fit in memory.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso
from sklearn.metrics import r2_score
import numpy as np

# Generate large synthetic dataset
X, y = make_regression(n_samples=100000, n_features=1000, noise=0.1, random_state=42)


# Split into initial train set and additional batch
X_train, X_new, y_train, y_new = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with warm_start=False
lr = Lasso(warm_start=False, random_state=42)
lr.fit(X_train, y_train)
y_pred_false = lr.predict(X_new)
r2_false = r2_score(y_new, y_pred_false)
print(f"R2 with warm_start=False: {r2_false:.3f}")

# Train with warm_start=True
X_combined = np.concatenate((X_train, X_new))
y_combined = np.concatenate((y_train, y_new))

lr.set_params(warm_start=True)
lr.fit(X_combined, y_combined)
y_pred_true = lr.predict(X_new)
r2_true = r2_score(y_new, y_pred_true)
print(f"R2 with warm_start=True: {r2_true:.3f}")

Running the example gives an output like:

R2 with warm_start=False: 1.000
R2 with warm_start=True: 1.000

The code above:

Generates a large synthetic regression dataset with 100,000 samples and 1,000 features
Splits the data into train and test sets
Fits a Lasso model with warm_start=False on the full training set and times it
Adds more data and updates the model, by using the previous model as a starting point by setting warm_start=True

Some tips and heuristics for using warm_start:

warm_start=True is most beneficial for very large datasets that don’t fit in memory
It allows adding more data in each call to fit(), incrementally improving the model
Convergence can be significantly faster compared to fitting from scratch each time

Issues to consider:

Data scaling and regularization strength (alpha) must be consistent across fit() calls
Incremental fitting may not produce the exact same results as fitting on the full dataset at once
Repeatedly calling fit() with warm_start=True can potentially lead to numerical instabilities over many iterations

See Also