The fit_intercept parameter in scikit-learn’s LinearRegression determines whether to calculate the intercept for the linear model.
When fit_intercept is set to True (default), the model tries to find the best-fitting line that intersects the origin. If set to False, the model forces the line to pass through the origin, which can be useful in certain scenarios.
The default value for fit_intercept is True, as most linear regression models benefit from having an intercept term. However, there are cases where setting it to False might be appropriate, such as when the data is already centered around the origin.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Generate synthetic dataset
X, y = make_regression(n_samples=100, n_features=1, noise=20, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with fit_intercept=True
lr_true = LinearRegression(fit_intercept=True)
lr_true.fit(X_train, y_train)
y_pred_true = lr_true.predict(X_test)
mse_true = mean_squared_error(y_test, y_pred_true)
print(f"fit_intercept=True, Coefficients: {lr_true.coef_}, Intercept: {lr_true.intercept_}, MSE: {mse_true:.2f}")
# Train with fit_intercept=False
lr_false = LinearRegression(fit_intercept=False)
lr_false.fit(X_train, y_train)
y_pred_false = lr_false.predict(X_test)
mse_false = mean_squared_error(y_test, y_pred_false)
print(f"fit_intercept=False, Coefficients: {lr_false.coef_}, MSE: {mse_false:.2f}")
The output of running this example would look like:
fit_intercept=True, Coefficients: [46.747264], Intercept: 0.19844442845175525, MSE: 416.81
fit_intercept=False, Coefficients: [46.71666433], MSE: 421.03
The key steps in this example are:
- Generate a synthetic regression dataset with a single feature
- Split the data into train and test sets
- Train
LinearRegressionmodels withfit_interceptset toTrueandFalse - Evaluate the mean squared error of each model on the test set
Tips and heuristics for setting fit_intercept:
- Include the intercept term unless there is a specific reason not to
- Excluding the intercept can be appropriate when the data is already centered around the origin
- Models without an intercept term may have worse performance if the true relationship has a non-zero intercept
Issues to consider:
- Setting
fit_intercepttoFalseforces the model to pass through the origin, which can impact model performance and interpretation - Excluding the intercept may introduce bias if the true relationship has a non-zero intercept
- The intercept term can help capture the overall level of the response variable