Taking the Naive Approach to Build a Spam Classifier

Have you ever wondered how your email service provider classifies a mail as spam or not spam almost immediately after you have received it? Or have you thought how the recommendations by online e-commerce platforms changes so quickly depending on real-time user actions. These are some of the real life scenarios where Naïve Bayes classifier is put into action.
Naive Bayes is a supervised classification algorithm that is used primarily for dealing with binary and multi-class classification problems, though with some modifications, it can also be used for solving regression problems. It is one of the simplest algorithm that is used for dealing with classification problems, especially when the dataset has less data points.
In this article, we will first be looking at the mathematical concepts behind Naïve Bayes, then we will take a look at the different types of Bayes Classifier and once we have a gist of what Naïve Bayes classifier actually is, then we will try and build our very own classifier. Also do read my previous articles to have a notion of the dissimilar classification algorithms.
Naïve Bayes is one of the simplest and widely employed supervised classification algorithm that is used for dealing with classification problem. It is an intuitive Classification Algorithm that is based on the principles of Bayes Theorem, named after Reverend Thomas Bayes, a Statistician.
Naive Bayes model is easy to build and is particularly useful for dealing with smaller data sets. Apart from its simplicity, Naive Bayes is also known to outperform even highly intricate predictive models, owing to its speed and accuracy. The algorithm performs exceptionally well with text data projects including the likes of sentiment data analysis, spam detection and document categorizing. The 3 main types of Naive Bayes algorithms:
- Gaussian Naive Bayes: Commonly used when features follow a Gaussian or normal distribution. This also requires to calculate the mean and standard deviation of the data.
- Multinomial Naive Bayes: Used for multinomially distributed data. This is suitable for classification with discrete features.
- Bernoulli Naive Bayes: Used for multivariate Bernoulli distributions. It requires the data points to be treated as binary-valued feature vectors.
Another variant of Naïve Bayes is the Complement Naïve Bayes or CNB that tends to work better than its counterpart when the classes in the training set are imbalanced. The Compliment Naive Bayes (CNB) classifier improves upon the weakness of the Naive Bayes classifier by estimating parameters from the data in all classes except the one which we are evaluating for.
Naive Bayes classifier can be trained faster as compared to its sister algorithms, and also it makes faster predictions. It can be modified with new training data without having to rebuild the model from scratch.
As mentioned earlier, this algorithm is based on the principles of Bayes’ theorem that describes the probability of an event, based on the prior knowledge of conditions that have already occurred and are in someway related to the event. The equation for Bayes Theorem is of the form:
where,
- P(A|B) is the probability of A given B. This is called the posterior probability.
- P(B|A) is the probability of data B given A.
- P(A) is the probability of A. This is called the prior probability of A.
- P(B) is the probability of the data.
Here, P(A|B) and P(B|A) are conditional probabilities. This is giving you the probability of event A occurring given that B has already occurred and vice versa, respectively. In situations, where there can be more than one outcome, the formulae used is:
Training a Naïve Bayes classifier is faster than most of its sister algorithms. This is due to the fact that only the probability of each class and the probability of each class given different input values need to be calculated.
This algorithm is referred to be “Naive” as it makes an assumption that each attribute is independent of the other attributes. However this may not be true in real life. In simple terms, a naive Bayes classifier assumes that the presence or absence of a particular attribute (also referred to as feature) of a class is not related to the presence of any other attribute of the class.
This assumption can be easily understood by means of an example. A flower may be considered as a Lotus if is Pink, grows in water, and has a horizontal spread of 3 ft. While using Naïve Bayes all these features contribute independently to the probability that the flower is a Lotus, even if they are dependent on each other.
Despite this the classifier works extremely well in lots of real world situations, especially while working with small datasets. In most of the use cases, it has comparable performance to neutral networks and SVM’s. However, if the dataset comprises of independent attributes, then it may even produce more optimized result than Logistic Regression algorithm.
Now that you are comfortable with the concepts of Naïve Bayes, we can try and build our own Naïve Bayes Classifier. The code and other resources used for building this model can be found on my GitHub repository.
Step 1: Importing the Required Libraries and Datasets
In order to build the model, our first step will be to import the required libraries. Jupyter notebook and Python gives us the flexibility to import these libraries at any point of our code. We need to import Pandas and Numpy Libraries to start off with building the model.
#Import the Libraries and read the data into a Pandas DataFrameimport pandas as pd
import numpy as npdf = pd.read_csv("framingham_heart_disease.csv")
df.head()
Pandas is a fast and easy to use tool built on top of core python libraries — Matplotlib and Numpy, and is used for performing data analysis and manipulation. The read_csv function is used to load the dataset into our notebook as a pandas dataframe, a two-dimensional data structure, similar to a table.
In this example we will be trying to build a spam classifier that will be able to classify a given email as spam or not spam. The dataset for building the model can be downloaded from here.
Step 2: Exploring the Datasets and Text Analytics
After loading the dataset, our next step is to explore the hidden prescience. The dataset that we use to build any machine learning model is full of insights. The missing values of the dataset can be detected using the isnull function. These records with missing values are either deleted or filled with the mean value of the record.
#Exploring the Datasetcount1 = Counter(" ".join(data[data['v1']=='ham']["v2"]).split()).most_common(20)
df1 = pd.DataFrame.from_dict(count1)
df1 = df1.rename(columns={0: "words in non-spam", 1 : "count"})count2 = Counter(" ".join(data[data['v1']=='spam']["v2"]).split()).most_common(20)
df2 = pd.DataFrame.from_dict(count2)
df2 = df2.rename(columns={0: "words in spam", 1 : "count_"})
The data can be visualized using a variety of plots such as pie chart and bar plots. These are present in the Matplotlib package. The data is only as good as the question it is trying to answer. Here, since we are trying to build a spam classifier, text analytics plays a major role.
Text analytics refers to the process of uncovering trends and patterns from data that is in the form of text. In this case, since we are classifying a mail as spam or not we will take the words present in the mails as the model features.
Step 3. Data Pre-processing and Feature Engineering
Now that we have prepared our dataset, the next thing we need to do is to select the features that are to be included while making predictions. Text Preprocessing, Tokenizing and Filtering of stopwords are done in this step.
#Feature Engineeringf = feature_extraction.text.CountVectorizer(stop_words = 'english')
X = f.fit_transform(data["v2"])
np.shape(X)
Stopwords are commonly used words that are mostly ignored by the softwares. Some of the most common stopwords include — “the”, “a”, “an” and “in”. These words are of no significance, but takes up space in the database. Removing stopwords is a crucial step in order to improve the analytics.
Using feature_extraction.text.CountVectorizer(stop_words = ‘english’) converts the document to a matrix of token counts. If stop_words = english is used, a built-in stop word list for English is used. After removing the stopwords, the next step will be to transform the variables into binary variables.
Step 4: Predictive Analysis and Building the Model
Now that we have selected the desired features and cleaned up the dataset, the next task is to split the dataset into training and testing data. We will be dividing the dataset into 77% training data and 23% testing data with a random state of 42.
#Fitting the Modelbayes = naive_bayes.MultinomialNB(alpha=list_alpha[best_index])
bayes.fit(X_train, y_train)
models.iloc[best_index, :]
Here, our model has 100% test precision and also, does not produce any false positive. From the confusion matrix, we found that our model has misclassified 56 spam messages as non-spam.
This simple algorithm performed surprisingly well. Even though our classifier has a decent accuracy, there is always room for improvements. If the continuous features are not distributed normally, we should first transform the features using suitable techniques.
This way the Naïve Bayes classifier’s performance can be boosted significantly. Ensemble techniques such as bagging and boosting generally won’t have much effect on Naïve Bayes classifier, as there is no variance to be minimized.
Naïve Bayes is one of the most widely used classification algorithm. Some of the features that make this algorithm so popular are as mentioned below:
- Naïve Bayes Classifier can be trained quickly and can generate predictions faster than other classifiers.
- It works well while handling multi-class problems.
- It requires much less training data than most of the classifiers and can perform exceptionally well in scenarios when the assumption of independence of features holds true or the dataset.
Though widely used, there can be certain limitations to the algorithms performance. Some of them are as mentioned below:
- The assumption of all features being independent of each other rarely holds true.
- If an individual class label is missing, then the frequency-based probability estimate will be zero. This is known as the Zero-Frequency problem. This can however be avoided by using smoothing techniques.
- Data Scarcity may lead the algorithm to numerical instabilities resulting in vague predictions.
To summarize what we have learned in this article, first we discussed about the mathematical concepts behind Naïve Bayes and how it can be used to build classification models. We then discussed about the different types of Bayes Classifiers and why the algorithm is named so.
We then continued our learning by building our very own classification model. To prop up our learning we discussed about the advantages and limitations of this classification technique and how the performance of the algorithm can be improvised
With that, we have reached the end of this article. I hope this article would have helped you get an essence of how to solve classification problems using Naïve Bayes and when to use it in your Machine Learning journey. If you have any question or if you believe I have made any mistake, feel free to contact me! You can get in touch with me via: E-Mail or LinkedIn. Happy Learning!