The fit_intercept
parameter in scikit-learn’s LinearRegression
determines whether to calculate the intercept for the linear model.
When fit_intercept
is set to True
(default), the model tries to find the best-fitting line that intersects the origin. If set to False
, the model forces the line to pass through the origin, which can be useful in certain scenarios.
The default value for fit_intercept
is True
, as most linear regression models benefit from having an intercept term. However, there are cases where setting it to False
might be appropriate, such as when the data is already centered around the origin.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Generate synthetic dataset
X, y = make_regression(n_samples=100, n_features=1, noise=20, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with fit_intercept=True
lr_true = LinearRegression(fit_intercept=True)
lr_true.fit(X_train, y_train)
y_pred_true = lr_true.predict(X_test)
mse_true = mean_squared_error(y_test, y_pred_true)
print(f"fit_intercept=True, Coefficients: {lr_true.coef_}, Intercept: {lr_true.intercept_}, MSE: {mse_true:.2f}")
# Train with fit_intercept=False
lr_false = LinearRegression(fit_intercept=False)
lr_false.fit(X_train, y_train)
y_pred_false = lr_false.predict(X_test)
mse_false = mean_squared_error(y_test, y_pred_false)
print(f"fit_intercept=False, Coefficients: {lr_false.coef_}, MSE: {mse_false:.2f}")
The output of running this example would look like:
fit_intercept=True, Coefficients: [46.747264], Intercept: 0.19844442845175525, MSE: 416.81
fit_intercept=False, Coefficients: [46.71666433], MSE: 421.03
The key steps in this example are:
- Generate a synthetic regression dataset with a single feature
- Split the data into train and test sets
- Train
LinearRegression
models withfit_intercept
set toTrue
andFalse
- Evaluate the mean squared error of each model on the test set
Tips and heuristics for setting fit_intercept
:
- Include the intercept term unless there is a specific reason not to
- Excluding the intercept can be appropriate when the data is already centered around the origin
- Models without an intercept term may have worse performance if the true relationship has a non-zero intercept
Issues to consider:
- Setting
fit_intercept
toFalse
forces the model to pass through the origin, which can impact model performance and interpretation - Excluding the intercept may introduce bias if the true relationship has a non-zero intercept
- The intercept term can help capture the overall level of the response variable