The dual parameter in scikit-learn’s LogisticRegression determines the solver used in the optimization problem.
Setting dual to True leads to solving the dual optimization problem, which can be faster and more stable for problems with a small number of samples and a large number of features.
By default, dual is set to False, solving the primal optimization problem. The dual parameter is only applicable when using the ’liblinear’ solver.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import time
# Generate synthetic dataset with a small number of samples and large number of features
X, y = make_classification(n_samples=100, n_features=1000, n_informative=50,
n_redundant=0, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with dual=False
start_time = time.time()
lr_primal = LogisticRegression(dual=False, solver='liblinear', random_state=42)
lr_primal.fit(X_train, y_train)
primal_time = time.time() - start_time
y_pred_primal = lr_primal.predict(X_test)
primal_accuracy = accuracy_score(y_test, y_pred_primal)
# Train with dual=True
start_time = time.time()
lr_dual = LogisticRegression(dual=True, solver='liblinear', random_state=42)
lr_dual.fit(X_train, y_train)
dual_time = time.time() - start_time
y_pred_dual = lr_dual.predict(X_test)
dual_accuracy = accuracy_score(y_test, y_pred_dual)
print(f"Primal form training time: {primal_time:.3f} seconds, Accuracy: {primal_accuracy:.3f}")
print(f"Dual form training time: {dual_time:.3f} seconds, Accuracy: {dual_accuracy:.3f}")
The output will look similar to:
Primal form training time: 0.006 seconds, Accuracy: 0.750
Dual form training time: 0.003 seconds, Accuracy: 0.750
The key steps in this example are:
- Generate a synthetic binary classification dataset with a small number of samples and a large number of features
- Split the data into train and test sets
- Train
LogisticRegressionmodels withdualset toFalseandTrue - Compare the training time and accuracy of the models
Tips and heuristics for setting dual:
- Consider setting
dualtoTruewhen the number of features is greater than the number of samples - The
dualparameter is only relevant when using the ’liblinear’ solver
Issues to consider:
- Setting
dualinappropriately can lead to longer training times without improvement in accuracy - The effect of
dualmay be less significant on larger datasets