Scikit-Learn RandomForestRegressor Model

RandomForestRegressor is a powerful ensemble learning algorithm that combines multiple decision trees to improve regression accuracy.

Key hyperparameters include n_estimators (number of trees in the forest), max_depth (maximum depth of the trees), and min_samples_split (minimum number of samples required to split an internal node).

This algorithm is suitable for regression problems where the goal is to predict a continuous output.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# generate a synthetic regression dataset
X, y = make_regression(n_samples=100, n_features=5, noise=0.1, random_state=1)

# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# create model
model = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=1)

# fit model
model.fit(X_train, y_train)

# evaluate model
yhat = model.predict(X_test)
mse = mean_squared_error(y_test, yhat)
print('Mean Squared Error: %.3f' % mse)

# make a prediction
row = [[-0.67888615, -0.09470897, 1.49138963, -0.638902, -0.44398196]]
yhat = model.predict(row)
print('Predicted: %.3f' % yhat[0])

Running the example gives an output like:

Mean Squared Error: 1078.389
Predicted: -17.048

The steps are as follows:

First, a synthetic regression dataset is generated using the make_regression() function. This creates a dataset with a specified number of samples (n_samples), features (n_features), and a fixed random seed (random_state) for reproducibility. The dataset is split into training and test sets using train_test_split().
Next, a RandomForestRegressor model is instantiated with 100 trees (n_estimators) and a maximum depth of 10 (max_depth). The model is then fit on the training data using the fit() method.
The performance of the model is evaluated by comparing the predictions (yhat) to the actual values (y_test) using the mean squared error metric.
A single prediction can be made by passing a new data sample to the predict() method.

This example demonstrates how to set up and use a RandomForestRegressor model for regression tasks, showcasing the flexibility and effectiveness of this algorithm in scikit-learn.

The model can handle the training data without the need for scaling or normalization. Once trained, it can be used to make predictions on new data, making it a practical choice for real-world regression problems.

See Also