
Bayes Rule is used in a lot of important models in Machine Learning. Here I will try and summarize all the Bayesian Classification models that I know of. I would be remiss to state that I would cover any percentage of the models that use Bayes Rule. I would delve into some of the concepts in Machine Learning, and Deep Learning that uses Bayes Rule. I would strive to update the article as I expand my literature on ML concepts.
For the sake of completeness and being respectful of all the audience of this article, let us start with the obvious. What is the Bayes Rule? In the literal sense, Bayes Rule is finding the likelihood of a hypothesis based on evidence — posterior likelihood, without neglecting the prior odds. It sounds like a lot of jargon and some obscure words, but prior and posterior induces a lot of brevity when discussing Bayes concepts.
What is prior odds? I would try to explain this with an example. A friend recently saw a UFO-like object from her roof-top. Does she start believing in extraterrestrial beings now that she saw a UFO? This brings up another important question does she believed in aliens before seeing the object. Based on people having strong opinions about aliens, this new evidence should hardly change her opinion about their existence. If she did not believe that aliens existed, before seeing the UFO, this new evidence should not necessarily point towards an alien presence. Here, the likelihood of the existence of the alien species before she spotted the UFO is a prior probability. The likelihood of the existence of aliens based on the spotting of the UFO is the posterior probability. Julia Galef’s — the host of Rationally Speaking podcast, take on this is pretty amazing.
In technical terms, Bayes Rule can be stated as the following:
In our example: A is the event of the existence of aliens. It is a continuum rather. B is the event of my friend spotting a UFO. P(A|B) is the probability that the alien species exists given my friend spotted the UFO. P(B|A) is the probability that aliens will fly over my friends’ roof, given they exist. P(A) is the probability that the aliens exist. P(B) is the probability of my friend spotting a UFO. Basically, the conclusion depends a lot(cannot be quantified technically but in this case, considering the extreme swing in values) on the prior probability.
A word of caution, more like a bailout for my usage of terminology: “Probability and Likelihood”. The nice thing about English is it isn’t that arcane, and probability and likelihood can be used interchangeably. Mathematicians will have me ousted of their houses on a dinner night for making this mistake. But, for my love of the English Language, I will be using likelihood and probability interchangeably.
Now, let’s get back on the track of the use of Bayesian rule in Machine Learning concepts.
Naive Bayesian Classifier:
This, as the name suggests, is a classification model based on Bayes rule. The idea is to predict class given multiple features. The idea is based on the assumption that all features are conditionally independent of each other — ergo called Naive. This reduces a larger Bayes Rule expression into a smaller one:
We calculate the probabilities of the above expression for the training dataset. Evaluate our model on the test dataset. We then use that to predict the class based on the class having the max probability given the features.
Laplace Smoothing:
The product-only terms in Naive Bayes may cause a problem when one of the conditional probability or likelihood terms is 0. To solve this, a pseudo count is added to the observations. To explain this with our previous example, let us say that probability of alien species flying over my friends’ roof is considered to be 0. It’s quite reasonable to assert that, but for the quote “Of all the gin joints, in all the towns, in all the universe”. Even, if we assume it to be 0, we cannot neglect the prior probability and knock it off by multiplying with 0. This example is quite stupid. A count-based example is more prudent here, for example, classifying an email as spam or ham. If our test data doesn’t have a spam email with the word, “Area51” doesn’t mean that an email with the word “Area51” will not be spam.
Bernoulli Naive Bayes:
So far, we haven’t really talked about the conditional probability terms’ distribution. But for the sake of this heading, we will.
If our random variable x|C follows a Bernoulli Distribution — each test data feature can result in one of the two possible outcomes, success/failure, we use Bernoulli Naive Bayes. For example, if we consider the feature — presence of the word “Area51” to classify our document as spam.
Multinomial Naive Bayes:
Similar to Bernoulli Naive Bayes here all our random variables xi|C follows a Multinomial Distribution. Multinomial Distribution is a generalization over the binomial distribution. Binomial has just two possible outcomes in each trial, and our random variable is the number of successes/failures. Multinomial can have the number of outcomes greater than 2. Using combinatorics, it can be written as the following:
Substituting above in the product of independent conditional likelihoods and our prior probability, we get:
That is all. Multinomial Naive Bayes is just a special case of Naive Bayes. An interesting thing happens if we take the logarithm on both sides of the above equation. It becomes a linear classifier. Logarithm has this amazing property that converts our products into summations. This property of logarithms is used a lot while computing loss functions in some ML models. Similarly, we can have other discrete and continuous distributions assumptions on our likelihood variables in Naive Bayes. For example, Poisson’s Distribution, Gaussian Distribution.
Bayesian Method for Neural Network:
To prevent overfitting in Neural Networks, we use various Regularization techniques. One such technique was presented in a seminal paper in 2015 — Weight Uncertainty in Neural Networks. In this technique, weights and biases are represented by a probability distribution over possible values. The idea is to learn the deviance in the weights as in the training data. So, we wind up an ensemble of networks where each weight is sampled from its learnt probability distribution. We assume Normal Distribution of our weights and biases (learning parameters) in the Neural Network. We then try to learn the probability density distribution of each of our weights and biases. Basically, we try and learn mean and the standard deviation of the weights and biases.
This gives us an ensemble of neural networks, where each network has its weights and biases sampled from the learnt probability distribution. The number of learning parameters is doubled in this technique. Monte Carlo estimates of the gradient are used to approximate the Bayesian Inference as the functional form of the loss function is not closed integrable. We may use PyTorch or PyMc3(https://pymc-devs.medium.com/the-future-of-pymc3-or-theano-is-dead-long-live-theano-d8005f8a0e9b) to perform Bayesian Neural Network training very simply.
References:
- Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, Daan Wierstra. Weight Uncertainty in Neural Networks. In Proceedings of the 32nd International Conference on Machine Learning. 21-May-2015.
- Eric J. M. An Attempt At Demystifying Bayesian Deep Learning. In Youtube, uploaded by PyData. 22-Dec-2017. https://youtu.be/s0S6HFdPtlA
- Difference between naive Bayes & multinomial naive Bayes. In StackExchange, uploaded by jlund3. 9-Aug-2012. https://stats.stackexchange.com/a/34002
- Introduction to the Multinomial Distribution. In Youtube, uploaded by jbstatistics. 19-Dec-2012. https://youtu.be/syVW7DgvUaY
- Naive Bayes classifier. In Wikipedia. 18-Sep-2002. https://en.wikipedia.org/wiki/Naive_Bayes_classifier
- Carl Doersch. Tutorial on Variational Autoencoders. In https://arxiv.org/abs/1606.05908. 19-Jun-2016.
- Kullback-Leibler Divergence Explained. In Count Bayesie blog. 10-May-2017. https://www.countbayesie.com/blog/2017/5/9/kullback-leibler-divergence-explained