Udacity Data Scientist Program
NLP is a subfield of artificial intelligence that allows machines to manipulate human natural languages. NLP has been exploited in many fields. In this Blog, We will be going through the steps of classifying an email wether it is a spam or not using NLP techniques. But first, let’s start by defining the problem that we are going to try solving in this blog.
Most of us should be familiar with spam emails. Cisco defines it as unwanted junk email sent out in bulk to an indiscriminate recipient list. Typically, spam is sent for commercial purposes. It can be sent in massive volume by botnets, networks of infected computers. Therefore, spam email filtering is an essential feature for email services such as Outlook and Gmail. Services providers are extensively using Machine learning techniques to filter and classify them successfully.
The dataset we are usigng is a public text data, it contains two columns including text (email) and spam (label). The spam column has two values (0 and 1), the text is labeled 1 if it is a spam or 0 otherwise.
You can download it from Kaggle.
In computing, a pipeline, also known as a data pipeline, is a set of data processing elements connected in series, where the output of one element is the input of the next one. The elements of a pipeline are often executed in parallel or in time-sliced fashion [1].
ETL (Extract, Transform and Load) Pipeline refers to a set of steps that allow us prepare our data for our machine learning model. Extracting is the process of downloading/ collecting data from a source (in our case we Kaggle). Transforming the data allows us to clean it and process it and Loading data is the step where we save/ store it into a database or file for further analysis or use it as an input for building a machine learning model.
In this step, we are going to build a NLP pipeline, this will include:
1. Text Processing
This step is learned from Udacity Data Scientist Program
Tokenize
Given a plain text, we first normalize it and convert it to lowercase and remove punctuation and finally split it up into words, these words are called tokenizers.
Clean
Remove stop words to reduce the vocabulary.
Normalize
In order to further simplify our text data, we can lemmatize or stem in this step. Lemmatization and Stemming usually refer to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma [3]
2. SKLearn Pipeline
The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters [4].
The components of sklearn pipeline are the following:
- The CountVectorizer () Where we pass in our tokenize function to provide vocabulary and encode new documents using that vocabulary
- The TfidfTransformer() Transforms a count matrix to a normalized tf or tf-idf representation. Tf means term-frequency while tf-idf means term-frequency times inverse document-frequency. This is a common term weighting scheme in information retrieval, that has also found good use in document classification.
- The Machine learning Classifier() In this Case I am using a simple machine learning algorithm ‘Naive Bayes’ as it is knows as one of the best text classifiers
3. Evaluation
In order to evaluate the performance of our machine learnig model, I am using the following metrics:
Accuracy: the fraction of predictions the model classified right
Confusion Matrix: a summary table to evaluate the accuracy of a classification, it breaks down the number of correct and incorrect predictions by each class.
We can see that we have 0 False negative and True Negative. While True Positive = 1398 and False Positive = 493 which is not bad.
In this post we have built an ETL pipeline that helps us prepare the data in order to use it for building a machine learning model that classifies emails.
We implement a basic NLP techniques and machine learning and found. The results are not bad as the accuracy of this model without hyper-parameter tuning nor adding another feature is 74% which is considerable not bad at all as we have a very small dataset and using the basic techniques of NLP and ML.
Next Step : I will try to improve the model using different NLP techniques.
[1] ETL Pipeline
[2] Udacity Data Scientist Program
[3] Stemming-and-Lemmatization
[4] sklearn.pipeline
Happy learning!