Scikit-Learn r2_score() Metric

r2_score() is a regression metric that evaluates the proportion of the variance in the dependent variable that is predictable from the independent variables. It indicates how well the regression model fits the data.

The r2_score() function measures the proportion of the variance in the dependent variable that is predictable from the independent variables. It is calculated as 1 minus the ratio of the sum of squared errors to the total sum of squares. This means it compares the fit of the model with a baseline model that always predicts the mean value of the target variable.

A score of 1 indicates a perfect fit, 0 indicates that the model does no better than predicting the mean, and negative values indicate that the model performs worse than the mean prediction. r2_score() is commonly used for evaluating regression models. However, it is not suitable for comparing models with different target variables and is sensitive to outliers, which can give misleading results if the model is non-linear.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Generate synthetic dataset
X, y = make_regression(n_samples=1000, n_features=1, noise=0.1, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

# Calculate R-squared score
r2 = r2_score(y_test, y_pred)
print(f"R-squared: {r2:.2f}")

Running the example gives an output like:

R-squared: 1.00

The steps are as follows:

Generate a synthetic regression dataset using make_regression().
Split the dataset into training and test sets using train_test_split().
Train a LinearRegression model on the training set.
Use the trained model to make predictions on the test set with predict().
Calculate the R-squared score using r2_score() by comparing the predicted labels to the true labels.

First, we generate a synthetic regression dataset using the make_regression() function from scikit-learn. This function creates a dataset with 1000 samples and 1 feature, allowing us to simulate a simple linear regression problem without using real-world data.

Next, we split the dataset into training and test sets using the train_test_split() function. This step is crucial for evaluating the performance of our model on unseen data. We use 80% of the data for training and reserve 20% for testing.

With our data prepared, we train a linear regression model using the LinearRegression class from scikit-learn. The fit() method is called on the model object, passing in the training features (X_train) and labels (y_train) to learn the underlying patterns in the data.

After training, we use the trained model to make predictions on the test set by calling the predict() method with X_test. This generates predicted labels for each sample in the test set.

Finally, we evaluate the performance of our model using the r2_score() function. This function takes the true labels (y_test) and the predicted labels (y_pred) as input and calculates the proportion of the variance in the dependent variable that is predictable from the independent variables. The resulting R-squared score is printed, giving us a quantitative measure of our model’s performance.

This example demonstrates how to use the r2_score() function from scikit-learn to evaluate the performance of a regression model.

See Also