# Naïve Bayesian classification

Recall from the Bayesian inference notes that Bayes’ theorem allows us to take a known probabilistic relationship, such as $P(\text{toothache}|\text{gum disease},\neg\text{cavity}) = 0.3$, and reason backwards. For example, we can determine $P(\text{gum disease}|\text{toothache})$, read “the probability that the patient has gum disease given that they have a toothache,” assuming we have knowledge of the various “conditionals” such as $P(\text{toothache}|\text{gum disease},\text{cavity})$ et al., and “priors” $P(\text{gum disease})$, $P(\text{cavity})$.

We can call “toothache” the *evidence* and “gum disease” the *prediction*. In other words, we can use Bayesian inference to perform classification. The “model” in this case is the known probabilities (the conditionals and priors). Given a new unclassified instance, we can treat the instance as evidence and perform Bayesian inference to predict the class: gum disease or no gum disease.

As with other classification techniques, we want to *learn* the probabilities from training data rather than code them by hand. Doing so efficiently and in a way that yields an efficient Bayesian inference procedure requires that we make the “naïve assumption.”

## Without the naïve assumption

Suppose we have some training examples:

Coat Color | Hat Color | Gentry? |
---|---|---|

Black | Black | Yes |

Black | Brown | No |

Blue | Black | No |

Blue | Brown | No |

Brown | Black | Yes |

Brown | Brown | No |

Our “class” (predicted) column will be Gentry and the evidence columns will be Coat and Hat Color. We saw previously how a decision tree would handle this data. Bayesian inference would treat the data as a probabilistic inference problem: $P(\text{Gentry}|\text{Coat=Black}, \text{Hat=Brown}) = ?$.

In order to solve that probability, we need to apply Bayes’ rule:

So far so good. We can easily learn $P(\text{Gentry})$ from the data:

But $P(\text{Coat=Black},\text{Hat=Brown}|\text{Gentry})$ and $P(\text{Coat=Black},\text{Hat=Brown})$ start to spell trouble. In this simple example, they are easy to compute. We can just count the number of rows in the table. But notice that $P(\text{Coat},\text{Hat}|\text{Gentry})$ represents a table containing $3*2*2=12$ probabilities for all the variations of coat, hat, and gentry. First, consider the table for Gentry=Yes:

Coat Color | Hat Color | $P(\text{Coat},\text{Hat}|\text{Gentry=Yes})$ |
---|---|---|

Black | Black | 1/2 |

Black | Brown | 0/2 |

Blue | Black | 0/2 |

Blue | Brown | 0/2 |

Brown | Black | 1/2 |

Brown | Brown | 0/2 |

And now the table for Gentry=No:

Coat Color | Hat Color | $P(\text{Coat},\text{Hat}|\text{Gentry=No})$ |
---|---|---|

Black | Black | 0/4 |

Black | Brown | 1/4 |

Blue | Black | 1/4 |

Blue | Brown | 1/4 |

Brown | Black | 0/4 |

Brown | Brown | 1/4 |

We have a final table for the probability of Coat and Hat Color co-occurring:

Coat Color | Hat Color | $P(\text{Coat},\text{Hat})$ |
---|---|---|

Black | Black | 1/6 |

Black | Brown | 1/6 |

Blue | Black | 1/6 |

Blue | Brown | 1/6 |

Brown | Black | 1/6 |

Brown | Brown | 1/6 |

**Now imagine we had $n$ different attributes**, ignoring the predicted class attribute. Bayes’ rule gives us:

The tables for $P(a_1,a_2,\dots,a_n|\text{class})$ and $P(a_1,a_2,\dots,a_n)$ will have $2^n$ rows for each class, assuming binary attributes. In other words, a simplistic Bayesian approach to the problem yields exponential models. That is unacceptable.

## With the naïve assumption

In order to reduce the complexity of the model, we’ll make a “naïve assumption.” We’ll assume all the evidence attributes (e.g., Coat Color and Hat Color) are **independent events**. This is probably not true at all: most gentry will intentionally coordinate their coat and hat color choices. But we make this assumption for expedience, even if it does not match reality.

By making this assumption, we transform the exponential model into a linear model:

or stated more compactly,

Now our probability tables are very small. First, the probability of a coat color if you know the gentry status:

Coat Color | $P(\text{Coat}|\text{Gentry=Yes})$ |
---|---|

Black | 1/2 |

Blue | 0/2 |

Brown | 1/2 |

And the other gentry status:

Coat Color | $P(\text{Coat}|\text{Gentry=No})$ |
---|---|

Black | 1/4 |

Blue | 2/4 |

Brown | 1/4 |

Now the probability of hat color if you know the gentry status:

Hat Color | $P(\text{Coat}|\text{Gentry=Yes})$ |
---|---|

Black | 2/2 |

Brown | 0/2 |

And the other gentry status:

Hat Color | $P(\text{Coat}|\text{Gentry=No})$ |
---|---|

Black | 1/4 |

Brown | 3/4 |

Now the “prior” probability of coat color:

Coat Color | $P(\text{Coat})$ |
---|---|

Black | 2/6 |

Blue | 2/6 |

Brown | 2/6 |

And the “prior” probability of hat color:

Hat Color | $P(\text{Hat})$ |
---|---|

Black | 3/6 |

Brown | 3/6 |

The resulting Bayesian network, with the naïve assumption in place, looks like this:

## A larger example

Consider the Iris dataset. In this case, the four attributes are continuous rather than discrete, so we’ll model them with Gaussian distributions. Thus, we need to compute the mean and standard deviation for each attribute, both as priors (across all the instances) and conditionals (for each subset of iris species).

Class (or “Prior”) | Attribute | Mean, standard deviation |
---|---|---|

(Prior) | Sepal length | m=5.843, sd=0.825 |

(Prior) | Sepal width | m=3.054, sd=0.432 |

(Prior) | Petal length | m=3.759, sd=1.759 |

(Prior) | Petal width | m=1.199, sd=0.761 |

Iris-setosa | Sepal length | m=5.006, sd=0.349 |

Iris-setosa | Sepal width | m=3.418, sd=0.377 |

Iris-setosa | Petal length | m=1.464, sd=0.172 |

Iris-setosa | Petal width | m=0.244, sd=0.106 |

Iris-versicolor | Sepal length | m=5.936, sd=0.511 |

Iris-versicolor | Sepal width | m=2.770, sd=0.311 |

Iris-versicolor | Petal length | m=4.260, sd=0.465 |

Iris-versicolor | Petal width | m=1.326, sd=0.196 |

Iris-virginica | Sepal length | m=6.588, sd=0.629 |

Iris-virginica | Sepal width | m=2.974, sd=0.319 |

Iris-virginica | Petal length | m=5.552, sd=0.546 |

Iris-virginica | Petal width | m=2.026, sd=0.272 |

We can compute probability densities using the mean and standard deviation with the “probability density function,” known as `dnorm`

in the R language, and denoted $P_D$ in the calculations below. The probability density function gives the y-axis value of a plot of the probability distribution. A probability density is not a probability, but we can get probabilities at the end by normalizing (dividing each result, as a density, by the sum of densities).

The resulting Bayesian network representing the relationships of the attributes and class is shown below. This network illustrates the “naïve assumption,” namely that the attributes are independent of each other (there are no arrows between attributes).

Now, suppose we have measurements of a new flower:

Sepal length | Sepal width | Petal length | Petal width |
---|---|---|---|

5.8 | 2.8 | 4.0 | 1.4 |

We can perform naïve Bayesian classification by computing the probability of each class, given these measurements. Note, it’s pretty clearly an example of Iris-versicolor.

First, the probability of Iris-setosa (IS):

Due to the much larger petal length and width than is typical for Iris-setosa, the probability density is virtually 0.0. Now, consider Iris-versicolor:

This result is >1 because we are using $P_D$, the probability density function, which is not the same as a probability. But we can normalize these results at the end to get actual probabilities.

Finally, we’ll check on Iris-virginica:

We can already see that Iris-versicolor got the most votes. To get actual probabilities, we would divide each result by the sum, giving Iris-versicolor a probability near 1.0.