The n_jobs
parameter in scikit-learn’s LinearRegression
controls the number of jobs to run in parallel when fitting the model. By leveraging multiple cores, it can significantly speed up the training process, especially on larger datasets.
LinearRegression
is an ordinary least squares linear regression model. It fits a linear model to minimize the residual sum of squares between the observed targets and the predictions.
The n_jobs
parameter determines the number of jobs run in parallel. Each job is run on a separate processing core for efficient computation.
The default value for n_jobs
is 1, meaning no parallelism is used. Setting n_jobs
to -1 will use all available cores on the machine.
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import time
# Generate synthetic dataset
X, y = make_regression(n_samples=10000, n_features=1000, noise=0.5, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different n_jobs values
n_jobs_values = [1, 2, 3, 4, -1]
fit_times = []
for n in n_jobs_values:
start = time.time()
lr = LinearRegression(n_jobs=n)
lr.fit(X_train, y_train)
end = time.time()
fit_time = end - start
fit_times.append(fit_time)
print(f"n_jobs={n}, Fit Time: {fit_time:.3f} seconds")
Running the example gives an output like:
n_jobs=1, Fit Time: 0.938 seconds
n_jobs=2, Fit Time: 0.837 seconds
n_jobs=3, Fit Time: 0.830 seconds
n_jobs=4, Fit Time: 1.307 seconds
n_jobs=-1, Fit Time: 1.978 seconds
The key steps in this example are:
- Generate a large synthetic regression dataset with noise
- Split the data into train and test sets
- Train
LinearRegression
models with differentn_jobs
values - Compare the model fit times for each
n_jobs
setting
Some tips and heuristics for setting n_jobs
:
- Use
n_jobs=-1
to leverage all available cores and maximize parallelism - For small datasets, the overhead of setting up parallelism may outweigh the speedup benefit
Issues to consider:
- Running jobs in parallel requires more memory as each job is allocated its own data
- For a very large number of jobs, the overhead and resource contention may limit scalability