• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar
  • Skip to footer
  • Home
  • Crypto Currency
  • Technology
  • Contact
NEO Share

NEO Share

Sharing The Latest Tech News

  • Home
  • Artificial Intelligence
  • Machine Learning
  • Computers
  • Mobile
  • Crypto Currency

SMS Spam Classifier (Natural Language Processing)

February 17, 2021 by systems

4.1 Train Test Split

Using the scikit-learn library, we can split the data into train and test. Here I have split the data into 77% (training data) and 33% (testing data)

4.2 Dealing with Text (Natural Language data) data

Now it’s time to talk about how to deal with the text data. We can’t directly pass the text to the machine learning model as the machine only understands data in the form of 0’s and 1’s.

To solve this problem, we will use the concept of TF-IDF Vectorizer (Term Frequency-Inverse Document Frequency). It is a standard algorithm to transform the text into a meaningful representation of numbers and is used to fit the machine algorithm for prediction.

We can also use Count Vectorizer (Bag of Words), but Count Vectorizer does not put weights on words, unlike TF-IDF vectorizer.

We can use the TF-IDF vectorizer from the scikit-learn library. Next, create an object of TF-IDF vectorizer and fit_transform to the data, which will convert into a matrix of words and sentences.

Here, 3733 are the sentences of the X_train, and 5772 are the total words obtained from the sentences.

4.3 Pipelining

We are doing Pipelining as we need to perform the same procedures for the test data to get predictions; that may be tiresome.

However, what is convenient about this pipeline object is that it can perform all these steps for you in a single cell, which means you can directly provide the data. It will be both vectorized and run the classifier in a single step.

Note:- When we will predict custom text later, we can directly pass the custom text to Pipeline, and it will help to predict the label

If you don’t know about the Pipeline, it takes a list of tuple where each tuple takes the name set by you and calls any method you want to perform.

from sklearn.pipeline import Pipeline

We will import the MultinomialNB model from the scikit-learn library. Next, we will create a model named “text_mnb” using Pipeline, where we first provided TfidfVectorizer() object and then MultinomialNB() object. It should be provided in a sequence as we want that firstly TfidfVectorizer should be executed and the output of it will be provided to the model, and at last, we fit the model with X_train and y_train.

Now every internal functionality will be handled by Pipeline and will perform the steps accordingly.

To make a prediction, we need to pass the X_test data, and the Pipeline object will handle it, i.e., automatically vectorize it and make predictions for us.

“y_preds_mnb” contains the predictions from the X_test made by our model and reaching an accuracy of approx 97%, which is relatively better than random chance.

We must know that accuracy itself is not capable of justifying that the model is working fine. We will use scikit-learn library to get a report on confusion_matrix and classification_report

Here we can see that “ham” label got predicted good but “spam” label prediction is not fine , so we can’t say that model is excellent. Model is lacking in predicting spam accurately.

Let’s try out the same problem with SVM (Support Vector Machine)

Linear SVC (Support Vector Classifier)

Same steps will be performed as above, only difference is that we need to import LinearSVC from scikit-learn library

Filed Under: Machine Learning

Primary Sidebar

Stay Ahead: The Latest Tech News and Innovations

Cryptocurrency Market Updates: What’s Happening Now

Emerging Trends in Artificial Intelligence: What to Watch For

Top Cloud Computing Services to Secure Your Data

The Future of Mobile Technology: Recent Advancements and Predictions

Footer

  • Privacy Policy
  • Terms and Conditions

Copyright © 2025 NEO Share

Terms and Conditions - Privacy Policy