Scikit-Learn mean_squared_error() Metric

Mean Squared Error (MSE) is a common metric for evaluating regression models.

It measures the average of the squares of the errors between predicted and actual values.

MSE is calculated by taking the mean of the squared differences between the predicted and actual values.

Lower MSE values indicate better model performance, while higher values suggest poorer performance.

MSE is primarily used for regression problems, not classification. However, it has limitations, including sensitivity to outliers and not being interpretable in terms of the original data units.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Generate synthetic regression dataset
X, y = make_regression(n_samples=1000, n_features=1, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

# Calculate mean squared error
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")

Running the example gives an output like:

Mean Squared Error: 0.01

The steps are as follows:

Generate a synthetic regression dataset using make_regression().
Split the dataset into training and test sets using train_test_split().
Train a LinearRegression model on the training set.
Use the trained model to make predictions on the test set with predict().
Calculate the mean squared error of the predictions using mean_squared_error() by comparing the predicted values to the actual values.

First, we generate a synthetic regression dataset using the make_regression() function from scikit-learn. This function creates a dataset with 1000 samples and 1 feature, allowing us to simulate a regression problem without using real-world data.

Next, we split the dataset into training and test sets using the train_test_split() function. This step is crucial for evaluating the performance of our model on unseen data. We use 80% of the data for training and reserve 20% for testing.

With our data prepared, we train a LinearRegression model using the LinearRegression class from scikit-learn. The fit() method is called on the model object, passing in the training features (X_train) and labels (y_train) to learn the underlying patterns in the data.

After training, we use the trained model to make predictions on the test set by calling the predict() method with X_test. This generates predicted values for each sample in the test set.

Finally, we evaluate the mean squared error of our model using the mean_squared_error() function. This function takes the true values (y_test) and the predicted values (y_pred) as input and calculates the average of the squared differences between them. The resulting mean squared error is printed, giving us a quantitative measure of our model’s performance.

This example demonstrates how to use the mean_squared_error() function from scikit-learn to evaluate the performance of a regression model.

See Also