Build naive spam classifier using naive Bayes
In our daily life, we get lots of emails. Some emails are useful and some are not. An unsolicited email sent in the bulk is a spam email. We do not generally want spam emails, so spam classifiers throw them in spam folders before they appear in our inbox section.
According to Statista, around 29% of the emails sent in 2019 are spam emails. It has been studied that spam emails impede economic growth and causes loss of billion dollars of GDP. Rao and Railey claim economic loss at over $1 trillion if firms were not investing in anti-spam technology.
The statistics are sufficient to underscore the importance of spam filters. With the progress in machine learning and deep learning increasing day by day, spam filters have made use of them to protect customers, and they have been successful to a large extent. From saving email reading time to protecting customers from frauds, deceits, and phishing, spam filters have done excellent work in preventing losses and increasing efficiency.
Today, let’s scratch spam email classification using one of the simplest techniques called naive Bayes classification. Naive Bayes classifiers are the classifiers that are based on Bayes’ theorem, a theorem that gives the probability of an event based on prior knowledge of conditions related to the event. It can be used to build a naive but good enough spam classifier, and we will see its use using a Python machine learning library, Sklearn.
At first, let’s import relevant libraries, sub-packages, modules, and classes.
In addition, let’s import some methods, functions, and classes from Scikit-learn (Sklearn), one of the widely used libraries in data science.
Now, let’s download the email dataset (around 5500 rows) from the dataset URL, which I got from the AIDevNepal’s GitHub repository. The dataset contains non-spam emails and spam emails. Also, let’s convert the labels to numerical values, 1 for spam and 0 for non-spam.
Shall we peek into the data?
Before training, let’s divide the dataset into training and validation. By default, Sklearn splits training and testing data in the ratio of 70:30.
Our raw dataset is the email messages. We can not feed such raw datasets to machine learning algorithms. Machine learning algorithms train models by doing computation, and the computation is possible with numerical values. So, let’s extract features from the raw dataset for training. For doing that, we transform all the email messages to the vectorized form using CountVectorizer class. Here, we take unigram and bigram, and train using the training examples.
Now, we create a multinomial Naive Bayes model using Sklearn API and train it with the dataset we created. Actually, naive Bayes is a performant machine learning algorithm on small datasets. It generalizes well with a small number of training examples, which complex models like neural networks fail at.
Let’s test the model by doing predictions on the testing set. We are transforming the raw test data by using the vectorizer we previously created.
The accuracy of our model on testing data is whopping 98.99%. WOW!!!
Now, let’s test our model with real-life emails and see how they predict.
Are you eager to see what our model predicts? Okay, here it is.
Here the output of the model predictions of all the given three emails is 0. And as we previously defined, 0 means non-spam. That’s right! I just tested with emails I received from my employers, colleagues, and friends.
Okay, what about spam emails in my spam folder? Let’s test them.
The output of the above example is:
Nailed it! It predicts everything as spam. You are a savior ❤