# Naïve Bayesian inference

Recall from the Bayesian inference notes that Bayes’ theorem allows us to take a known probabilistic relationship, such as $P(\text{toothache}|\text{gum disease},\neg\text{cavity}) = 0.3$, and reason backwards. For example, we can determine $P(\text{gum disease}|\text{toothache})$, read “the probability that the patient has gum disease given that they have a toothache,” assuming we have knowledge of the various “conditionals” such as $P(\text{toothache}|\text{gum disease},\text{cavity})$ et al., and “priors” $P(\text{gum disease})$, $P(\text{cavity})$.

We can call “toothache” the *evidence* and “gum disease” the *prediction*. Naturally, we want to *learn* the probabilities from training data rather than code them by hand. Doing so efficiently and in a way that yields an efficient Bayesian inference procedure requires that we make the “naïve assumption.”

## Without the naïve assumption

Suppose we have some training examples:

Coat Color | Hat Color | Gentry? |
---|---|---|

Black | Black | Yes |

Black | Black | No |

Black | Brown | Yes |

Blue | Black | No |

Blue | Brown | No |

Blue | Brown | No |

Brown | Black | Yes |

Brown | Brown | No |

Our predicted column will be Gentry and the evidence columns will be Coat and Hat Color. Bayesian inference treats this as a probabilistic inference problem: $P(\text{Gentry}|\text{Coat=Black}, \text{Hat=Brown}) = ?$.

In order to solve that probability, we need to apply Bayes’ rule:

So far so good. We can easily learn $P(\text{Gentry})$ from the data:

But $P(\text{Coat=Black},\text{Hat=Brown}|\text{Gentry})$ and $P(\text{Coat=Black},\text{Hat=Brown})$ start to spell trouble. In this simple example, they are easy to compute. We can just count the number of rows in the table. But notice that $P(\text{Coat},\text{Hat}|\text{Gentry})$ represents a table containing $3*2*2=12$ probabilities for all the variations of coat, hat, and gentry. Consider the table for Gentry=Yes (the 3 is due to having 3 cases of Gentry=Yes):

Coat Color | Hat Color | $P(\text{Coat},\text{Hat}|\text{Gentry=Yes})$ |
---|---|---|

Black | Black | 1/3 = 0.33 |

Black | Brown | 1/3 = 0.33 |

Blue | Black | 0/3 = 0.0 |

Blue | Brown | 0/3 = 0.0 |

Brown | Black | 1/3 = 0.33 |

Brown | Brown | 0/3 = 0.0 |

We also have a table for the probability of Coat and Hat Color co-occurring (the 8 is due to having 8 training examples in total):

Coat Color | Hat Color | $P(\text{Coat},\text{Hat})$ |
---|---|---|

Black | Black | 2/8 = 0.25 |

Black | Brown | 1/8 = 0.125 |

Blue | Black | 1/8 = 0.125 |

Blue | Brown | 2/8 = 0.25 |

Brown | Black | 1/8 = 0.125 |

Brown | Brown | 1/8 = 0.125 |

**Now imagine we had $n$ different attributes**, ignoring the predicted attribute. Bayes’ rule gives us:

The tables for $P(a_1,a_2,\dots,a_n|\text{pred})$ and $P(a_1,a_2,\dots,a_n)$ will have $2^n$ rows for each pred, assuming binary attributes. In other words, a simplistic Bayesian approach to the problem yields exponential models. That is unacceptable.

## With the naïve assumption

In order to reduce the complexity of the model, we’ll make a “naïve assumption.” We’ll assume all the evidence attributes (e.g., Coat Color and Hat Color) are **independent events**. This is probably not true at all: most gentry will intentionally coordinate their coat and hat color choices. But we make this assumption for expedience, even if it does not match reality.

By making this assumption, we transform the exponential model into a linear model:

or stated more compactly,

Now our probability tables are very small. First, the probability of a coat color if you know the gentry status:

Coat Color | $P(\text{Coat}|\text{Gentry=Yes})$ |
---|---|

Black | 2/3 = 0.66 |

Blue | 0/3 = 0.0 |

Brown | 1/3 = 0.33 |

Now the probability of hat color if you know the gentry status:

Hat Color | $P(\text{Hat}|\text{Gentry=Yes})$ |
---|---|

Black | 2/3 = 0.66 |

Brown | 1/3 = 0.33 |

Now the “prior” probability of coat color:

Coat Color | $P(\text{Coat})$ |
---|---|

Black | 3/8 = 0.375 |

Blue | 3/8 = 0.375 |

Brown | 2/8 = 0.25 |

And the “prior” probability of hat color:

Hat Color | $P(\text{Hat})$ |
---|---|

Black | 4/8 = 0.5 |

Brown | 4/8 = 0.5 |

The resulting Bayesian network, with the naïve assumption in place, looks like this:

## Gentry example in ProbLog

ProbLog can learn probabilities from data. Rather than add a number in front of a predicate, e.g., `0.5::coat(black)`

, we can indicate the probability is “tunable” with `t()`

: `t(_)::coat(black)`

or we can start with an initial probability that will update from training: `t(0.5)::coat(black)`

.

Here is our domain logic:

```
coatColor(black).
coatColor(blue).
coatColor(brown).
hatColor(black).
hatColor(brown).
t(_, CoatColor)::coat(CoatColor) :- coatColor(CoatColor).
t(_, HatColor)::hat(HatColor) :- hatColor(HatColor).
t(_, CoatColor, HatColor)::gentry :-
coatColor(CoatColor), hatColor(HatColor),
coat(CoatColor), hat(HatColor).
```

And here is our evidence file, with different cases separated by dashes:

```
evidence(gentry).
evidence(coat(black)).
evidence(coat(blue), false).
evidence(coat(brown), false).
evidence(hat(black)).
evidence(hat(brown), false).
evidence(hat(blue), false).
-----
evidence(gentry, false).
evidence(coat(black)).
evidence(coat(blue), false).
evidence(coat(brown), false).
evidence(hat(black)).
evidence(hat(brown), false).
evidence(hat(blue), false).
-----
evidence(gentry).
evidence(coat(black)).
evidence(coat(blue), false).
evidence(coat(brown), false).
evidence(hat(brown)).
evidence(hat(black), false).
evidence(hat(blue), false).
-----
evidence(gentry, false).
evidence(coat(blue)).
evidence(coat(black), false).
evidence(coat(brown), false).
evidence(hat(black)).
evidence(hat(blue), false).
evidence(hat(brown), false).
-----
evidence(gentry, false).
evidence(coat(blue)).
evidence(coat(black), false).
evidence(coat(brown), false).
evidence(hat(brown)).
evidence(hat(black), false).
evidence(hat(blue), false).
-----
evidence(gentry, false).
evidence(coat(blue)).
evidence(coat(black), false).
evidence(coat(brown), false).
evidence(hat(brown)).
evidence(hat(black), false).
evidence(hat(blue), false).
-----
evidence(gentry).
evidence(coat(brown)).
evidence(coat(black), false).
evidence(coat(blue), false).
evidence(hat(black)).
evidence(hat(brown), false).
evidence(hat(blue), false).
-----
evidence(gentry, false).
evidence(coat(brown)).
evidence(coat(black), false).
evidence(coat(blue), false).
evidence(hat(brown)).
evidence(hat(black), false).
evidence(hat(blue), false).
```

Now we run the ProbLog command to learn the tunable probabilities from the evidence:

```
problog lfi -O gentry.model gentry.pl gentry-evidence.pl
```

The resulting model is:

```
coatColor(black).
coatColor(blue).
coatColor(brown).
hatColor(black).
hatColor(brown).
0.375::coat(blue) :- coatColor(blue).
0.25::coat(brown) :- coatColor(brown).
0.375::coat(black) :- coatColor(black).
0.5::hat(black) :- hatColor(black).
0.5::hat(brown) :- hatColor(brown).
0.0::gentry :- coatColor(brown), hatColor(brown), coat(brown), hat(brown).
0.0::gentry :- coatColor(blue), hatColor(brown), coat(blue), hat(brown).
0.0::gentry :- coatColor(blue), hatColor(black), coat(blue), hat(black).
0.999999999702::gentry :- coatColor(black), hatColor(brown), coat(black), hat(brown).
0.5::gentry :- coatColor(black), hatColor(black), coat(black), hat(black).
0.999999999702::gentry :- coatColor(brown), hatColor(black), coat(brown), hat(black).
```

## A larger example

Consider the Iris dataset. In this case, the four attributes are continuous rather than discrete, so we’ll model them with Gaussian distributions. Thus, we need to compute the mean and standard deviation for each attribute, both as priors (across all the instances) and conditionals (for each subset of iris species).

Class (or “Prior”) | Attribute | Mean, standard deviation |
---|---|---|

(Prior) | Sepal length | m=5.843, sd=0.825 |

(Prior) | Sepal width | m=3.054, sd=0.432 |

(Prior) | Petal length | m=3.759, sd=1.759 |

(Prior) | Petal width | m=1.199, sd=0.761 |

Iris-setosa | Sepal length | m=5.006, sd=0.349 |

Iris-setosa | Sepal width | m=3.418, sd=0.377 |

Iris-setosa | Petal length | m=1.464, sd=0.172 |

Iris-setosa | Petal width | m=0.244, sd=0.106 |

Iris-versicolor | Sepal length | m=5.936, sd=0.511 |

Iris-versicolor | Sepal width | m=2.770, sd=0.311 |

Iris-versicolor | Petal length | m=4.260, sd=0.465 |

Iris-versicolor | Petal width | m=1.326, sd=0.196 |

Iris-virginica | Sepal length | m=6.588, sd=0.629 |

Iris-virginica | Sepal width | m=2.974, sd=0.319 |

Iris-virginica | Petal length | m=5.552, sd=0.546 |

Iris-virginica | Petal width | m=2.026, sd=0.272 |

We can compute probability densities using the mean and standard deviation with the “probability density function,” known as `dnorm`

in the R language, and denoted $P_D$ in the calculations below. The probability density function gives the y-axis value of a plot of the probability distribution. A probability density is not a probability, but we can get probabilities at the end by normalizing (dividing each result, as a density, by the sum of densities).

The resulting Bayesian network representing the relationships of the attributes and class (species) is shown below. This network illustrates the “naïve assumption,” namely that the attributes are independent of each other (there are no arrows between attributes).

Now, suppose we have measurements of a new flower:

Sepal length | Sepal width | Petal length | Petal width |
---|---|---|---|

5.8 | 2.8 | 4.0 | 1.4 |

We can perform naïve Bayesian classification by computing the probability of each class, given these measurements. Note, it’s pretty clearly an example of Iris-versicolor.

First, the probability of Iris-setosa (IS):

Due to the much larger petal length and width than is typical for Iris-setosa, the probability density is virtually 0.0. Now, consider Iris-versicolor:

This result is >1 because we are using $P_D$, the probability density function, which is not the same as a probability. But we can normalize these results at the end to get actual probabilities.

Finally, we’ll check on Iris-virginica:

We can already see that Iris-versicolor got the most votes. To get actual probabilities, we would divide each result by the sum, giving Iris-versicolor a probability near 1.0.