
4.1 Train Test Split
Using the scikit-learn library, we can split the data into train and test. Here I have split the data into 77% (training data) and 33% (testing data)
4.2 Dealing with Text (Natural Language data) data
Now it’s time to talk about how to deal with the text data. We can’t directly pass the text to the machine learning model as the machine only understands data in the form of 0’s and 1’s.
To solve this problem, we will use the concept of TF-IDF Vectorizer (Term Frequency-Inverse Document Frequency). It is a standard algorithm to transform the text into a meaningful representation of numbers and is used to fit the machine algorithm for prediction.
We can also use Count Vectorizer (Bag of Words), but Count Vectorizer does not put weights on words, unlike TF-IDF vectorizer.
We can use the TF-IDF vectorizer from the scikit-learn library. Next, create an object of TF-IDF vectorizer and fit_transform to the data, which will convert into a matrix of words and sentences.
Here, 3733 are the sentences of the X_train, and 5772 are the total words obtained from the sentences.
4.3 Pipelining
We are doing Pipelining as we need to perform the same procedures for the test data to get predictions; that may be tiresome.
However, what is convenient about this pipeline object is that it can perform all these steps for you in a single cell, which means you can directly provide the data. It will be both vectorized and run the classifier in a single step.
Note:- When we will predict custom text later, we can directly pass the custom text to Pipeline, and it will help to predict the label
If you don’t know about the Pipeline, it takes a list of tuple where each tuple takes the name set by you and calls any method you want to perform.
from sklearn.pipeline import Pipeline
We will import the MultinomialNB model from the scikit-learn library. Next, we will create a model named “text_mnb” using Pipeline, where we first provided TfidfVectorizer() object and then MultinomialNB() object. It should be provided in a sequence as we want that firstly TfidfVectorizer should be executed and the output of it will be provided to the model, and at last, we fit the model with X_train and y_train.
Now every internal functionality will be handled by Pipeline and will perform the steps accordingly.
To make a prediction, we need to pass the X_test data, and the Pipeline object will handle it, i.e., automatically vectorize it and make predictions for us.
“y_preds_mnb” contains the predictions from the X_test made by our model and reaching an accuracy of approx 97%, which is relatively better than random chance.
We must know that accuracy itself is not capable of justifying that the model is working fine. We will use scikit-learn library to get a report on confusion_matrix and classification_report
Here we can see that “ham” label got predicted good but “spam” label prediction is not fine , so we can’t say that model is excellent. Model is lacking in predicting spam accurately.
Let’s try out the same problem with SVM (Support Vector Machine)
Linear SVC (Support Vector Classifier)
Same steps will be performed as above, only difference is that we need to import LinearSVC from scikit-learn library