Today, we will learn about the competence of a very “naive” algorithm known as Naive Bayes. We would also be discussing its implementation (from scratch) on a dataset containing both categorical and continuous values.
Naive Bayes is a supervised learning algorithm used for classification problems. It comes from the family of “probabilistic classifiers” and applies Bayes theorem for classifying objects with the assumption that all the features are independent of each other. This is the assumptions that makes it “naive” because in real life, it is highly unlikely to find datasets which has no multicollinearity or zero correlation amongst the predictors.
Despite this assumption, it does a pretty good job of classifying objects.
So before we implement our naive bayes algorithm, we should first understand what Bayes Theorem is all about.
Before we learn about Bayes theorem, we need to understand the concept of conditional probability.
Conditional Probability:
Conditional probability means the probability of an event occurring given that some condition is true. It is represented as P(event|condition).
Let’s understand it using a “very sophisticated” example.
Suppose a person wants to punch a wall (I don’t know why though but moving on). What is the probability that it would hurt given he punches the wall? Here, the event of which we find the probability is “hurt hands” and the condition is “punching the wall”.
Our example would be represented as P(“hurt hands”|“punching a wall”).
It should be noted that,
P(“hurt hands”|“punching a wall”) ≠P(“punching a wall”|“hurt hands”)
The probability of a person hurting his hand after punching a wall is not the same as the probability of him punching a wall while his hand is hurting. (Just read it a few times and it would make sense to you)
Coming back to Bayes Theorem…
This theorem was introduced by this gentleman below:
It gives us a way for calculating the conditional probability.
Mathematically this can be described as,
Many times, we are not given the value of the denominator P(B)(this term is also known as evidence). We can calculate this by using the following equation:
P(B) = P(B|A)*P(A) + P(B|not A)*P(not A)
Note that:
- P(not A) = 1-P(A)
- P(B|not A) = 1-P(not B|not A)
Using the above equation, we have a new formulation of the Bayes theorem as,
P(A|B) = P(B|A)*P(A)/(P(B|A)*P(A) + P(B|not A)*P(not A))
- Find different classes/groups in our response variable. In our example, the response variable is “flu?” and it is divided into 2 classes: “Yes” and “No”.
- For each predictor variable, we determine the different types of values it has. For example, for “chills” , “fever” and “runny nose” columns they have values of types: “Yes” and “No”; for “headache” column it has “Mild”, “No”and “Strong”.
- For each type of value in a particular predictor, we find out their “likelihood” of being classified into a certain class of response variable. For example,
In case someone has flu what is the probability that they have a mild headache or P(headaches= “mild”|flu = “yes”) ?
Total number of rows having (headaches = “mild” ) is 3. Out of these three rows, we see that 2 of them are classified into “yes” class in the “flu?” column.
Hence, P(headaches= “mild”|flu = “yes”) = 2/3
Similarly, we calculate the others as follows:
Given that flu = “yes” (likelihoods):
- P(headaches= “mild”|flu = “yes”) = 2/3,
- P(headaches= “no” |flu = “yes”) = 1/2
- P(headaches= “strong”|flu = “yes”)=2/3
- P(chills= “yes”|flu = “yes”) = 3/4 and P(chills= “no”|flu = “yes”) = 2/4
- P(runny nose= “yes”|flu = “yes”) = 4/5 and P(runny nose= “no”|flu = “yes”) = 1/3
- P(fever= “yes”|flu = “yes”) = 4/5 and P(fever= “no”|flu = “yes”) = 1/3
Given that flu = “no” (likelihoods):
- P(headaches= “mild”|flu = “no”) = 1/3
- P(headaches= “no”|flu = “no”) = 1/2
- P(headaches= “strong”|flu = “no”)=1/3
- P(chills= “yes” |flu = “no”) = 1/4 and P(chills= “no”|flu = “no”) = 2/4
- P(runny nose= “yes” |flu = “no”) = 1/5 and P(runny nose= “no”|flu = “no”) = 2/3
- P(fever= “yes”|flu = “no”) = 1/5 and P(fever= “no”|flu = “no”) = 2/3
Other probabilities:
- P(flu = “yes”) = 5/8 and P(flu = “no”) = 3/8
- P(chills = “yes”) = 4/8 and P(chills= “no”) = 4/8
- P(headache= “mild”) = 3/8 ; P(headache= “strong”) = 3/8 and P(headache= “no”) = 2/8
- P(runny nose = “yes”) = 5/8 and P(runny nose = “no”) = 3/8
- P(fever = “yes”) = 5/8 and P(fever= “no”) = 3/8
Example: To find out the likelihood of getting a flu given a runny nose (posterior) using Bayes Theorem
- P(flu= “yes”|runny nose= “yes”) =
P(runny nose: “yes”|flu= “yes”)*P(flu: “yes”) /P(runny nose: “yes”)
= ((4/5)*(5/8))/(5/8)=4/5
Now, we will predict whether for the given conditions we will have a flu or not
Let the above condition be X. We need to calculate:
- P(flu = “yes”|X) = P(X|flu = “yes”)* P(flu= “yes”)/P(X)
- P(flu = “no”|X) = P(X|flu = “no”)* P(flu= “no”)/P(X)
Since P(X) is common just ignore it.
Calculating P(flu= “yes”|X) = 5/8 * (3/4* 1/3 * 2/3 * 1/3) = 0.0347 by multiplying the values given below:
- P(flu = “yes”) = 5/8
- P(chills= “yes”|flu = “yes”) = 3/4
- P(runny nose= “no”|flu = “yes”) = 1/3
- P(headaches= “mild”|flu = “yes”) = 2/3
- P(fever= “no”|flu = “yes”) = 1/3
The above values was calculated using the equation below:
P(X₁, X₂, X₃,…Xₙ| “yes”) = P(X₁|“yes”)*P(X₂|“yes”)*P(X₃|“yes”)….*P(Xₙ|“yes”)
Similarly we calculate for flu = “no”,
Calculating P(flu= “no”|X) = (3/8)*(1/4)*(2/3)*(1/3)*(2/3) = 0.0138
- P(flu = “no”) = 3/8
- P(chills= “yes” |flu = “no”) = 1/4
- P(runny nose= “no”|flu = “no”) = 2/3
- P(headaches= “mild”|flu = “no”) = 1/3
- P(fever= “no”|flu = “no”) = 2/3
Since, P(X|flu= “no”) < P(X|flu= “yes”), it is highly likely that a person will get a flu for the given set of conditions.
Oof that is a lot of calculations.
All the calculations we have done for finding conditional probability is done for categorical data. The above won’t apply for continuous values for obvious reasons. This is where we introduce gaussian naive bayes.
If we want to calculate likelihood of a point given a condition y =c :
- First find out rows which satisfy y=c
- calculate the mean and standard deviation for each feature or column for this given set of rows.
- For each data point in this set of rows, plug in the value of x (which is the feature value of that data point), the column’s mean and standard deviation in the expression given below.
- These values for each column is then multiplied to find out the likelihood of that data point by applying , P(X|Y=c) = P(X₁, X₂, X₃,…Xₙ| Y = c) = P(X₁|Y = c)*P(X₂|Y = c)*P(X₃|Y = c)….*P(Xₙ|Y = c)
where, X={X₁, X₂, X₃,…Xₙ} are the features/independent variables.
- Importing numpy, math and seaborn libraries.
- I have loaded the dataset “penguins” using seaborn.load_dataset(“penguins”).
- For this dataset, our response variable is “species” and rest of the columns are feature values.
Before I implement naive bayes on the dataset, I checked for null values:
Some rows had 5 values as Null and some rows only had the values under “sex” column as null.
- First, I found out the mode value in the “sex” column and replace null values in this column with this value. In my case, mode was “Male”.
- If some of the rows still have null values, then I dropped it.
I divided the dataset in a 80:20 ratio for training and testing dataset after converting dataset into a numpy array (I just find it easier to work with) :
We find the mean,standard deviation for continuous data and likelihoods for categorical data. These values are then used on testing data to find the predicted values.
stats_by_class(x,y):
divides dataset by class type of response variable and then returns the mean, standard deviation for continuous data and finds log likelihood for categorical data.
arguments: x_train set and y_train set.
- It first finds out all the unique class types for our response variable.
- Initializing dictionary “groups” for each class type with all the rows which has y = <Class_name>.
- “stats” stores the mean and standard deviation (both for each column), column index ( to identity columns with continuous values) and number of rows for each class.
- “category_prob” stores the log likelihoods for the columns containing categorical data.
find_stats(x):
argument: x is the dataset whose corresponding response variable belongs to a particular class type.
It checks each column whether they contain data of string class or not. If it is not of string type then, we find mean and standard deviation.
gaussian_prob(x_i,mean,std):
finds the probability using gaussian distribution expression for x_i where, x_i is equivalent to x[row][column] and mean and standard deviation is passed for that particular column.
find_category_prob(x,y,cls):
argument: here, cls is class type.
finds the log-likelihood for all predictors given that it is classified under a class type of response variable.
Again, we check for data type of the column, if it contains data of string class then, we find:
- unique types in that column.
- find the rows in that column where y_train[row] = cls for each type
- divide the above count with the total count of that type in that column.
- Then find its log and store it in probabilities dictionary where key value is the column index.
find_prob_for_class(stats, test_row, category_prob=None):
This function finds the posterior values for P(Y=cls|X) where X is the set of conditions presented by our test row.
We call the gaussian_prob() to find likelihoods for continuous columns and find its log value. We add all these log values with log likelihood values provided by category_prob dictionary and the log value of our class probability.
we find probability of that class by dividing stats[cls][3] with total_rows in the dataset.
predict(stats, test_row, category_prob=None):
It gets log value of posteriors from find_prob_for_class() function and returns the class which has the maximum posterior value.
Anyways, that wraps up our blog. I hope it gives you a bit of a head start about this topic. Thank you for taking your time to read this!!!
Have a good day 😄!!!
P.S. I am all ears for all the corrections and suggestions. Feel free to point out if I did anything wrong and how I can do better.