RANSAC (RANdom SAmple Consensus) is a robust regression algorithm that iteratively fits a model to a subset of the data while identifying inliers and excluding outliers. This makes it ideal for regression problems with potential outliers.
The key hyperparameters of RANSACRegressor
include the estimator
(the model used for fitting), min_samples
(minimum number of samples required to fit the model), residual_threshold
(maximum residual for a sample to be classified as an inlier), and max_trials
(maximum number of iterations for random sampling).
This algorithm is suitable for regression problems, especially when dealing with datasets containing outliers.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import RANSACRegressor, LinearRegression
from sklearn.metrics import mean_absolute_error
# generate regression dataset
X, y = make_regression(n_samples=100, n_features=1, noise=20, random_state=1)
# introduce outliers
import numpy as np
np.random.seed(0)
n_outliers = 10
X[:n_outliers] = 3 + 0.5 * np.random.normal(size=(n_outliers, 1))
y[:n_outliers] = -3 + 10 * np.random.normal(size=n_outliers)
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
# create model
ransac = RANSACRegressor(estimator=LinearRegression())
# fit model
ransac.fit(X_train, y_train)
# evaluate model
yhat = ransac.predict(X_test)
mae = mean_absolute_error(y_test, yhat)
print('Mean Absolute Error: %.3f' % mae)
# make a prediction
row = [[1.5]]
yhat = ransac.predict(row)
print('Predicted: %.3f' % yhat[0])
Running the example gives an output like:
Mean Absolute Error: 15.119
Predicted: 125.165
The steps are as follows:
First, a synthetic regression dataset is generated using the
make_regression()
function with added noise to simulate real-world data. Outliers are introduced manually to the dataset.The dataset is split into training and test sets using
train_test_split()
.A
RANSACRegressor
model is instantiated withLinearRegression
as the base estimator. The model is then fit on the training data using thefit()
method.The performance of the model is evaluated by predicting on the test set and calculating the Mean Absolute Error (MAE) using
mean_absolute_error()
.A single prediction can be made by passing a new data sample to the
predict()
method.
This example demonstrates the robustness of the RANSACRegressor
in handling outliers effectively in a regression task. The model is fit to the inlier data, ignoring the outliers, leading to a more reliable regression model.