Configure LogisticRegression "dual" Parameter

The dual parameter in scikit-learn’s LogisticRegression determines the solver used in the optimization problem.

Setting dual to True leads to solving the dual optimization problem, which can be faster and more stable for problems with a small number of samples and a large number of features.

By default, dual is set to False, solving the primal optimization problem. The dual parameter is only applicable when using the ’liblinear’ solver.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import time

# Generate synthetic dataset with a small number of samples and large number of features
X, y = make_classification(n_samples=100, n_features=1000, n_informative=50,
                           n_redundant=0, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with dual=False
start_time = time.time()
lr_primal = LogisticRegression(dual=False, solver='liblinear', random_state=42)
lr_primal.fit(X_train, y_train)
primal_time = time.time() - start_time
y_pred_primal = lr_primal.predict(X_test)
primal_accuracy = accuracy_score(y_test, y_pred_primal)

# Train with dual=True
start_time = time.time()
lr_dual = LogisticRegression(dual=True, solver='liblinear', random_state=42)
lr_dual.fit(X_train, y_train)
dual_time = time.time() - start_time
y_pred_dual = lr_dual.predict(X_test)
dual_accuracy = accuracy_score(y_test, y_pred_dual)

print(f"Primal form training time: {primal_time:.3f} seconds, Accuracy: {primal_accuracy:.3f}")
print(f"Dual form training time: {dual_time:.3f} seconds, Accuracy: {dual_accuracy:.3f}")

The output will look similar to:

Primal form training time: 0.006 seconds, Accuracy: 0.750
Dual form training time: 0.003 seconds, Accuracy: 0.750

The key steps in this example are:

Generate a synthetic binary classification dataset with a small number of samples and a large number of features
Split the data into train and test sets
Train LogisticRegression models with dual set to False and True
Compare the training time and accuracy of the models

Tips and heuristics for setting dual:

Consider setting dual to True when the number of features is greater than the number of samples
The dual parameter is only relevant when using the ’liblinear’ solver

Issues to consider:

Setting dual inappropriately can lead to longer training times without improvement in accuracy
The effect of dual may be less significant on larger datasets

See Also