The dual
parameter in scikit-learn’s LogisticRegression
determines the solver used in the optimization problem.
Setting dual
to True
leads to solving the dual optimization problem, which can be faster and more stable for problems with a small number of samples and a large number of features.
By default, dual
is set to False
, solving the primal optimization problem. The dual
parameter is only applicable when using the ’liblinear’ solver.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import time
# Generate synthetic dataset with a small number of samples and large number of features
X, y = make_classification(n_samples=100, n_features=1000, n_informative=50,
n_redundant=0, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with dual=False
start_time = time.time()
lr_primal = LogisticRegression(dual=False, solver='liblinear', random_state=42)
lr_primal.fit(X_train, y_train)
primal_time = time.time() - start_time
y_pred_primal = lr_primal.predict(X_test)
primal_accuracy = accuracy_score(y_test, y_pred_primal)
# Train with dual=True
start_time = time.time()
lr_dual = LogisticRegression(dual=True, solver='liblinear', random_state=42)
lr_dual.fit(X_train, y_train)
dual_time = time.time() - start_time
y_pred_dual = lr_dual.predict(X_test)
dual_accuracy = accuracy_score(y_test, y_pred_dual)
print(f"Primal form training time: {primal_time:.3f} seconds, Accuracy: {primal_accuracy:.3f}")
print(f"Dual form training time: {dual_time:.3f} seconds, Accuracy: {dual_accuracy:.3f}")
The output will look similar to:
Primal form training time: 0.006 seconds, Accuracy: 0.750
Dual form training time: 0.003 seconds, Accuracy: 0.750
The key steps in this example are:
- Generate a synthetic binary classification dataset with a small number of samples and a large number of features
- Split the data into train and test sets
- Train
LogisticRegression
models withdual
set toFalse
andTrue
- Compare the training time and accuracy of the models
Tips and heuristics for setting dual
:
- Consider setting
dual
toTrue
when the number of features is greater than the number of samples - The
dual
parameter is only relevant when using the ’liblinear’ solver
Issues to consider:
- Setting
dual
inappropriately can lead to longer training times without improvement in accuracy - The effect of
dual
may be less significant on larger datasets