Scikit-Learn root_mean_squared_error() Metric

Root Mean Squared Error (RMSE) is a commonly used metric for evaluating the performance of regression models.

It represents the square root of the average squared differences between predicted and actual values. In other words, RMSE provides a measure of how accurately the model predicts the target variable.

The root_mean_squared_error() function in scikit-learn calculates the Rppt Mean Squared Error (RMSE) by averaging the squared differences between the predicted and actual values then by taking the square root of this value. The root_mean_squared_error() function takes the true labels and predicted labels as input and returns a float value, with lower values indicating better performance.

RMSE is used for regression problems where the goal is to minimize the difference between predicted and actual values. However, it has some limitations. RMSE is sensitive to outliers because larger errors have a disproportionately large effect on the metric.

This can lead to skewed evaluations if the dataset contains significant outliers.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import root_mean_squared_error
import numpy as np

# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=1, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

# Calculate RMSE
rmse = root_mean_squared_error(y_test, y_pred)
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")

Running the example gives an output like:

Root Mean Squared Error (RMSE): 0.10

The steps are as follows:

Generate a synthetic regression dataset using make_regression().
Split the dataset into training and test sets using train_test_split().
Train a LinearRegression model on the training set.
Use the trained model to make predictions on the test set with predict().
Calculate the RMSE of the predictions using root_mean_squared_error() by comparing the predicted labels to the true labels.

First, we generate a synthetic regression dataset using the make_regression() function from scikit-learn. This function creates a dataset with 1000 samples and a single feature, simulating a regression problem without using real-world data.

Next, we split the dataset into training and test sets using the train_test_split() function. This step is crucial for evaluating the performance of our model on unseen data. We use 80% of the data for training and reserve 20% for testing.

With our data prepared, we train a linear regression model using the LinearRegression class from scikit-learn. The fit() method is called on the model object, passing in the training features (X_train) and labels (y_train) to learn the underlying patterns in the data.

After training, we use the trained model to make predictions on the test set by calling the predict() method with X_test. This generates predicted labels for each sample in the test set.

Finally, we evaluate the RMSE of our model using the root_mean_squared_error() function. This function takes the true labels (y_test) and the predicted labels (y_pred) as input, calculates the RMSE. The resulting RMSE score is printed, giving us a quantitative measure of our model’s performance.

This example demonstrates how to use the mean_squared_error() function from scikit-learn to evaluate the performance of a regression model and obtain the RMSE metric.

See Also