The var_smoothing
parameter in scikit-learn’s GaussianNB
controls the amount of variance smoothing applied to data for numerical stability.
GaussianNB
is a variant of the Naive Bayes classifier that assumes the features follow a Gaussian distribution. It is particularly effective for continuous data.
The var_smoothing
parameter adds a small value to the variance of each feature to ensure numerical stability, preventing division by zero or very small numbers.
The default value for var_smoothing
is 1e-9.
In practice, values between 1e-12 and 1e-5 are commonly used depending on the dataset’s properties.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5,
n_redundant=0, random_state=42)
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train with different var_smoothing values
var_smoothing_values = [1e-12, 1e-9, 1e-6, 1e-3]
accuracies = []
for vs in var_smoothing_values:
gnb = GaussianNB(var_smoothing=vs)
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracies.append(accuracy)
print(f"var_smoothing={vs}, Accuracy: {accuracy:.3f}")
Running the example gives an output like:
var_smoothing=1e-12, Accuracy: 0.775
var_smoothing=1e-09, Accuracy: 0.775
var_smoothing=1e-06, Accuracy: 0.775
var_smoothing=0.001, Accuracy: 0.775
The key steps in this example are:
- Generate a synthetic binary classification dataset with informative and noise features.
- Split the data into train and test sets.
- Train
GaussianNB
models with differentvar_smoothing
values. - Evaluate the accuracy of each model on the test set.
Some tips and heuristics for setting var_smoothing
:
- Start with the default value and adjust based on the model performance and dataset characteristics.
- Smaller values may improve precision on stable datasets, while larger values can help with noisy data.
Issues to consider:
- The optimal
var_smoothing
value depends on the dataset’s numerical stability. - Extremely small or large values can lead to underfitting or overfitting.