In order to understand Naïve Bayes, we require some basic knowledge of probability and the Bayes Theorem. To understand this, lets consider an example:
A Dice and a Coin are thrown. What is the probability of a HEAD and a 4?
We know that Probability P of an Event E can be calculated as P(E) = Number of favorable Outcomes / Total Number of Outcomes. So we can calculate the probability for this example as: (1/2) * (1/6) = 1/12.
There are a few different types of probabilities that we need to understand before staring with Naïve Bayes Algorithm.
- Conditional Probability:
It is the measure of probability of an Event A occurring given that another Event B has occurred. - Joint Probability:
It is the measure of two or more events happening simultaneously. - Marginal Probability:
It is the measure of probability of an Event irrespective of the outcome of other Events. - Proportionality:
It is the relationship between two quantities that are multiplicatively connected to a constant.
Bayes Theorem
The Naïve Bayes Classifier is based on Bayes Theorem with the “Naïve Assumption” that the features are independent of each other. To understand this assumption, we shall look into what Bayes Theorem states:
- Posterior Probability:
It is the probability of Event A occurring given that Event B has already occurred. - Prior Probability:
It is the probability of Event A to occur before any relevant evidence is taken into account. - Likelihood:
It is the probability of Event B occurring given that Event A has already occurred. - Marginalization:
It is the probability of Event B occurring.
Given a feature vector X=(x₁,x₂,…,xₙ) and a class variable Cₖ, Bayes Theorem states that:
Here, P(Cₖ|X) is the Posterior Probability, P(X|Cₖ) is the likelihood, P(Cₖ) the Prior Probability, and P(X) the Marginalization factor or the Prior Probability of Predictor.
Using chain rule, we can calculate the likelihood P(X|Cₖ) as:
P(X|Cₖ) = P(x₁,…,xₙ|Cₖ) = P(x₁|x₂,…,xₙ, Cₖ)*P(x₂|x₃,…,xₙ, Cₖ)*…*P(xₙ₋₁|xₙ, Cₖ)*P(xₙ|Cₖ)
The Naïve Assumption
It is extremely difficult and expensive to calculate P(X|Cₖ) using the above formula. Instead, Naïve Bayes assumes conditional independence of each feature. This makes it easier to find the Posterior Probability as:
Assuming: P(xᵢ|xᵢ+₁,…,xₙ|Cₖ) = P(xᵢ|Cₖ)
We can conclude: P(X|Cₖ) = P(x₁,…,xₙ|Cₖ) = ⁿ∏ᵢ₌₁ P(xᵢ|Cₖ)
Thus, Posterior Probability can be written as:
Here, P(X) is the Prior Probability of predictor which is constant given the input. So we can say:
and finally, we need to find the maximum of ⁿ∏ᵢ₌₁ P(xᵢ|Cₖ) for different class values of Cₖ :
Types of Naïve Bayes
There are 3 types of Naïve Bayes Algorithm used in practice:
- Multinomial Naïve Bayes:
It assumes that each P(xₙ|Cₖ) follows a multinomial distribution. It is mainly used in document classification problems and looks at frequency of words. - Bernoulli Naïve Bayes:
It is similar to Multinomial Naïve Bayes, except that the predictors are Boolean. - Gaussian Naïve Bayes:
It assumes that continuous values are sampled from a gaussian distribution.