Configure LinearRegression "n_jobs" Parameter

The n_jobs parameter in scikit-learn’s LinearRegression controls the number of jobs to run in parallel when fitting the model. By leveraging multiple cores, it can significantly speed up the training process, especially on larger datasets.

LinearRegression is an ordinary least squares linear regression model. It fits a linear model to minimize the residual sum of squares between the observed targets and the predictions.

The n_jobs parameter determines the number of jobs run in parallel. Each job is run on a separate processing core for efficient computation.

The default value for n_jobs is 1, meaning no parallelism is used. Setting n_jobs to -1 will use all available cores on the machine.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import time

# Generate synthetic dataset
X, y = make_regression(n_samples=10000, n_features=1000, noise=0.5, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different n_jobs values
n_jobs_values = [1, 2, 3, 4, -1]
fit_times = []

for n in n_jobs_values:
    start = time.time()
    lr = LinearRegression(n_jobs=n)
    lr.fit(X_train, y_train)
    end = time.time()
    fit_time = end - start
    fit_times.append(fit_time)
    print(f"n_jobs={n}, Fit Time: {fit_time:.3f} seconds")

Running the example gives an output like:

n_jobs=1, Fit Time: 0.938 seconds
n_jobs=2, Fit Time: 0.837 seconds
n_jobs=3, Fit Time: 0.830 seconds
n_jobs=4, Fit Time: 1.307 seconds
n_jobs=-1, Fit Time: 1.978 seconds

The key steps in this example are:

Generate a large synthetic regression dataset with noise
Split the data into train and test sets
Train LinearRegression models with different n_jobs values
Compare the model fit times for each n_jobs setting

Some tips and heuristics for setting n_jobs:

Use n_jobs=-1 to leverage all available cores and maximize parallelism
For small datasets, the overhead of setting up parallelism may outweigh the speedup benefit

Issues to consider:

Running jobs in parallel requires more memory as each job is allocated its own data
For a very large number of jobs, the overhead and resource contention may limit scalability

See Also