*Naive Bayes** is a probabilistic machine learning algorithm based on the **Bayes Theorem**, used in a wide variety of classification tasks.*

*In this article, we shall be understanding the Naive Bayes algorithm and its essential concepts so that there is no room for doubts in understanding.*

*Naive Bayes is a simple but surprisingly powerful probabilistic machine learning algorithm used for predictive modeling and classification tasks.*

Some typical applications of Naive Bayes are ** spam filtering**,

**,**

*sentiment prediction***, etc. It is a popular algorithm mainly because it can be easily written in code and predictions can be made real quick which in turn increases the scalability of the solution.**

*classification of documents*The Naive Bayes algorithm is traditionally considered the algorithm of choice for practical-based applications mostly in cases where instantaneous responses are required for user’s requests.

It is based on the works of the Rev. Thomas Bayes and hence the name. Before starting off with Naive Bayes, it is important to learn about Bayesian learning, what is ‘Conditional Probability’ and ‘Bayes Rule’.

Bayesian learning is a *supervised learning technique* where the goal is to build a model of the distribution of class labels that have a concrete definition of the target attribute. Naïve Bayes is based on applying Bayes’ theorem with the ** naïve **assumption of independence between each and every pair of features.

*Let us start with the primitives by understanding Conditional Probability with some examples:*

**Example-I**

*Consider you have a coin and fair dice. When you flip a coin, there is an equal chance of getting either a head or a tail. So you can say that the probability of getting heads or the probability of getting tails is 50%.*

Now if you roll the fair dice, the probability of getting 1 out of the 6 numbers would be 1/6 = 0.166. The probability will also be the same for other numbers on the dice.

## Example-II

*Consider another example of playing cards. You are asked to pick a card from the deck. Can you guess the probability of getting a king given the card is a heart?*

The given condition here is that the card is a heart, so the denominator has to be 13 *(there are 13 hearts in a deck of cards)* and not 52. Since there is only one king in hearts, so the probability that the card is a king given it is a heart is **1/13 = 0.077**.

*So when you say the** conditional probability** of **A given B**, it refers to the probability of the occurrence of A given that B has already occurred. This is a typical example of **conditional probability.*

*Mathematically*, the *conditional probability* of A given B can be defined as:

*P(A and B) / P(B)*

*Bayes’ Theorem helps you examine the probability of an event based on the prior knowledge of any event that has a correspondence to the former event. Its uses are mainly found in probability theory and statistics.*

Consider for example the probability that the price of a house is high can be calculated better if we have some prior information like the facilities around it compared to another assessment made without the knowledge of the location of the house.

`P(A|B) = `**[**P(B|A)P(A)**]**/**[**P(B)**]**

*The equation above shows the basic representation of the Bayes’ theorem where A and B are two events and:*

*P(A|B)**:* The conditional probability that event A occurs, given that B has occurred. This is termed as the *posterior probability.*

*P(A) and P(B)**:* The probability of A and B without any correspondence with each other.

*P(B|A)**: *The conditional probability of the occurrence of event B, given that A has occurred.

*The Bayes’ Theorem can be reformulated in correspondence with the machine learning algorithm as:*

`posterior = (prior x likelihood) / (evidence)`

Consider a situation where the number of attributes is n and the response is a Boolean value. i.e. Either

TrueorFalse. The attributes are categorical (2 categories in this case). You need to train the classifier for all the values in the instance and the response space.

This example is practically not possible in most machine learning algorithms since you need to compute **2∗(2^n-1)** parameters for learning this model. *This means for 30 boolean attributes, you will need to learn more than 3 billion parameters which is unrealistic.*

Naive Bayes classifiers in machine learning are a family of simple probabilistic machine learning models that are based on Bayes’ Theorem. *In simple words, it is a classification technique with an assumption of independence among predictors.*

The Naive Bayes classifier reduces the complexity of the Bayesian classifier by making an assumption of ** conditional dependence** over the training dataset.

Consider you are given variables X, Y, and Z. X will be conditionally independent of Y given Z if and only if the probability distribution of X is independent of the value of Y given Z. This is the assumption of conditional dependence.

In other words, you can also say that X and Y are conditionally independent given Z if and only if, the knowledge of the occurrence of X provides no information on the likelihood of the occurrence of Y and vice versa, given that Z occurs. This assumption is the reason behind the term

naivein Naive Bayes.

*The likelihood can be written considering n different attributes as:*

` n `

P(X₁...Xₙ|Y) = π P(Xᵢ|Y)

i=1

*In the mathematical expression:*

** X** represents the

**&**

*attributes***represents the**

*Y*

*response variable.*So,

P(X|Y)becomes equal to the product of the probability distribution of each attribute given Y.

**Maximizing a Posteriori**

If you want to find the posterior probability of **P(Y|X)** for multiple values of Y, you need to calculate the expression for all the different values of Y.

Let us assume a new instance variable

X_NEW. You need to calculate the probability that Y will take any value given the observed attributes ofX_NEWand given the distributions P(Y) andP(X|Y)which are estimated from the training dataset.

In order to predict the response variable depending on the different values obtained for **P(Y|X)**, you need to consider a *probable value *or the *maximum of the values.*

Hence, this method is known as

maximizing a posteriori.

**Maximizing Likelihood**

You can simplify the Naive Bayes algorithm if you assume that the response variable is *uniformly distributed* which means that it is equally likely to get any response.

The advantage of this assumption is that the priori or the P(Y) becomes a constant value.

Since the a priori and the evidence become independent from the response variable, they can be removed from the equation.

So, maximizing the posteriori becomes maximizing the likelihood problem.

Consider a situation where you have 1000 fruits which are either ‘banana’ or

‘apple’ or ‘other’. These will be the possible classes of the variable Y.

*The data for the following X variables all of which are in binary (0 and 1):*

*The training dataset will look like this:*