The warm_start
parameter in scikit-learn’s Lasso
class allows reusing the solution from the previous call to fit()
as initialization for the next fit()
operation. This can speed up convergence when fitting incrementally on very large datasets.
Lasso (Least Absolute Shrinkage and Selection Operator) is a linear regression algorithm that performs L1 regularization, which adds a penalty term to the loss function to encourage sparse solutions (i.e., many coefficients set to zero). This makes Lasso useful for feature selection and creating interpretable models.
The warm_start
parameter is a boolean value that defaults to False
. When set to True
, it allows the model to be fitted incrementally on new data, continuing from the solution of the previous fit()
call. This is particularly useful when working with datasets that are too large to fit in memory.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso
from sklearn.metrics import r2_score
import numpy as np
# Generate large synthetic dataset
X, y = make_regression(n_samples=100000, n_features=1000, noise=0.1, random_state=42)
# Split into initial train set and additional batch
X_train, X_new, y_train, y_new = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with warm_start=False
lr = Lasso(warm_start=False, random_state=42)
lr.fit(X_train, y_train)
y_pred_false = lr.predict(X_new)
r2_false = r2_score(y_new, y_pred_false)
print(f"R2 with warm_start=False: {r2_false:.3f}")
# Train with warm_start=True
X_combined = np.concatenate((X_train, X_new))
y_combined = np.concatenate((y_train, y_new))
lr.set_params(warm_start=True)
lr.fit(X_combined, y_combined)
y_pred_true = lr.predict(X_new)
r2_true = r2_score(y_new, y_pred_true)
print(f"R2 with warm_start=True: {r2_true:.3f}")
Running the example gives an output like:
R2 with warm_start=False: 1.000
R2 with warm_start=True: 1.000
The code above:
- Generates a large synthetic regression dataset with 100,000 samples and 1,000 features
- Splits the data into train and test sets
- Fits a Lasso model with
warm_start=False
on the full training set and times it - Adds more data and updates the model, by using the previous model as a starting point by setting
warm_start=True
Some tips and heuristics for using warm_start
:
warm_start=True
is most beneficial for very large datasets that don’t fit in memory- It allows adding more data in each call to
fit()
, incrementally improving the model - Convergence can be significantly faster compared to fitting from scratch each time
Issues to consider:
- Data scaling and regularization strength (alpha) must be consistent across
fit()
calls - Incremental fitting may not produce the exact same results as fitting on the full dataset at once
- Repeatedly calling
fit()
withwarm_start=True
can potentially lead to numerical instabilities over many iterations