Configure GaussianNB "var_smoothing" Parameter

The var_smoothing parameter in scikit-learn’s GaussianNB controls the amount of variance smoothing applied to data for numerical stability.

GaussianNB is a variant of the Naive Bayes classifier that assumes the features follow a Gaussian distribution. It is particularly effective for continuous data.

The var_smoothing parameter adds a small value to the variance of each feature to ensure numerical stability, preventing division by zero or very small numbers.

The default value for var_smoothing is 1e-9.

In practice, values between 1e-12 and 1e-5 are commonly used depending on the dataset’s properties.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5,
                           n_redundant=0, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train with different var_smoothing values
var_smoothing_values = [1e-12, 1e-9, 1e-6, 1e-3]
accuracies = []

for vs in var_smoothing_values:
    gnb = GaussianNB(var_smoothing=vs)
    gnb.fit(X_train, y_train)
    y_pred = gnb.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"var_smoothing={vs}, Accuracy: {accuracy:.3f}")

Running the example gives an output like:

var_smoothing=1e-12, Accuracy: 0.775
var_smoothing=1e-09, Accuracy: 0.775
var_smoothing=1e-06, Accuracy: 0.775
var_smoothing=0.001, Accuracy: 0.775

The key steps in this example are:

Generate a synthetic binary classification dataset with informative and noise features.
Split the data into train and test sets.
Train GaussianNB models with different var_smoothing values.
Evaluate the accuracy of each model on the test set.

Some tips and heuristics for setting var_smoothing:

Start with the default value and adjust based on the model performance and dataset characteristics.
Smaller values may improve precision on stable datasets, while larger values can help with noisy data.

Issues to consider:

The optimal var_smoothing value depends on the dataset’s numerical stability.
Extremely small or large values can lead to underfitting or overfitting.

See Also