Bayes Rule is used in a lot of important models in Machine Learning. Here I will try and summarize all the Bayesian Classification models that I know of. I would be remiss to state that I would cover any percentage of the models that use Bayes Rule. I would delve into some of the concepts in Machine Learning, and Deep Learning that uses Bayes Rule. I would strive to update the article as I expand my literature on ML concepts.

For the sake of completeness and being respectful of all the audience of this article, let us start with the obvious. What is the Bayes Rule? In the literal sense, Bayes Rule is finding the likelihood of a hypothesis based on evidence — posterior likelihood, without neglecting the prior odds. It sounds like a lot of jargon and some obscure words, but prior and posterior induces a lot of brevity when discussing Bayes concepts.

What is prior odds? I would try to explain this with an example. A friend recently saw a UFO-like object from her roof-top. Does she start believing in extraterrestrial beings now that she saw a UFO? This brings up another important question does she believed in aliens before seeing the object. Based on people having strong opinions about aliens, this new evidence should hardly change her opinion about their existence. If she did not believe that aliens existed, before seeing the UFO, this new evidence should not necessarily point towards an alien presence. Here, the likelihood of the existence of the alien species before she spotted the UFO is a prior probability. The likelihood of the existence of aliens based on the spotting of the UFO is the posterior probability. Julia Galef’s — the host of Rationally Speaking podcast, take on this is pretty amazing.

In technical terms, Bayes Rule can be stated as the following:

In our example: A is the event of the existence of aliens. It is a continuum rather. B is the event of my friend spotting a UFO. P(A|B) is the probability that the alien species exists given my friend spotted the UFO. P(B|A) is the probability that aliens will fly over my friends’ roof, given they exist. P(A) is the probability that the aliens exist. P(B) is the probability of my friend spotting a UFO. Basically, the conclusion depends a lot(cannot be quantified technically but in this case, considering the extreme swing in values) on the prior probability.

A word of caution, more like a bailout for my usage of terminology: “Probability and Likelihood”. The nice thing about English is it isn’t that arcane, and probability and likelihood can be used interchangeably. Mathematicians will have me ousted of their houses on a dinner night for making this mistake. But, for my love of the English Language, I will be using likelihood and probability interchangeably.

Now, let’s get back on the track of the use of Bayesian rule in Machine Learning concepts.

## Naive Bayesian Classifier:

This, as the name suggests, is a classification model based on Bayes rule. The idea is to predict class given multiple features. The idea is based on the assumption that all features are conditionally independent of each other — ergo called Naive. This reduces a larger Bayes Rule expression into a smaller one:

We calculate the probabilities of the above expression for the training dataset. Evaluate our model on the test dataset. We then use that to predict the class based on the class having the max probability given the features.

## Laplace Smoothing:

The product-only terms in Naive Bayes may cause a problem when one of the conditional probability or likelihood terms is 0. To solve this, a pseudo count is added to the observations. To explain this with our previous example, let us say that probability of alien species flying over my friends’ roof is considered to be 0. It’s quite reasonable to assert that, but for the quote “Of all the gin joints, in all the towns, in all the universe”. Even, if we assume it to be 0, we cannot neglect the prior probability and knock it off by multiplying with 0. This example is quite stupid. A count-based example is more prudent here, for example, classifying an email as spam or ham. If our test data doesn’t have a spam email with the word, “Area51” doesn’t mean that an email with the word “Area51” will not be spam.

## Bernoulli Naive Bayes:

So far, we haven’t really talked about the conditional probability terms’ distribution. But for the sake of this heading, we will.