Scikit-Learn mean_squared_log_error() Metric

Mean squared log error (MSLE) is a metric used to evaluate the performance of regression models. It measures the average of the squared logarithmic differences between actual and predicted values. This metric is useful for datasets where the target variable spans several orders of magnitude.

The mean_squared_log_error() function in scikit-learn calculates MSLE by first log-transforming the true and predicted values, then computing the squared differences, and finally averaging them. It takes the true labels and predicted labels as input and returns a float value, with lower values indicating better performance.

MSLE is suitable for regression problems where relative differences matter more than absolute differences, especially when dealing with large value ranges. However, it is not suitable for datasets with zero or negative target values, as log transformation is undefined for these values.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_log_error
import numpy as np

# Generate synthetic regression dataset
X, y = make_regression(n_samples=1000, n_features=1, noise=0.1, random_state=42)

# Apply transformation to ensure all target values are positive
y = np.abs(y)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

# Calculate mean squared log error
msle = mean_squared_log_error(y_test, y_pred)
print(f"Mean Squared Log Error: {msle:.2f}")

Running the example gives an output like:

Mean Squared Log Error: 0.59

The steps are as follows:

Generate a synthetic regression dataset using make_regression() with 1000 samples and 1 feature. Transform the target values to be positive to ensure the log transformation is valid.
Split the dataset into training and test sets using train_test_split().
Train a LinearRegression model on the training data.
Use the trained model to make predictions on the test data.
Calculate the mean_squared_log_error using the true and predicted values, and print the result.

First, we generate a synthetic regression dataset using the make_regression() function from scikit-learn. This function creates a dataset with 1000 samples and 1 feature, allowing us to simulate a regression problem. We then transform the target values to be positive using np.abs() to ensure the log transformation is valid.

Next, we split the dataset into training and test sets using the train_test_split() function. This step is crucial for evaluating the performance of our model on unseen data. We use 80% of the data for training and reserve 20% for testing.

With our data prepared, we train a linear regression model using the LinearRegression class from scikit-learn. The fit() method is called on the model object, passing in the training features (X_train) and labels (y_train) to learn the underlying patterns in the data.

After training, we use the trained model to make predictions on the test set by calling the predict() method with X_test. This generates predicted values for each sample in the test set.

Finally, we evaluate the mean squared log error of our model using the mean_squared_log_error() function. This function takes the true labels (y_test) and the predicted labels (y_pred) as input and calculates the average of the squared logarithmic differences. The resulting MSLE score is printed, giving us a quantitative measure of our model’s performance.

See Also