In this article, we’ll go through the process of building a sentiment analysis model using Python. Specifically, we’ll create a bag-of-words model using an SVM. By interpreting this model, we also understand how it works. Along the way, you will learn the basics of text processing. We’ll go over key pieces of code and you can find the full project on GitHub. Before we dive into all of that, let’s start by explaining what sentiment analysis is.
Sentiment is an idea or feeling that someone expresses in words. With that in mind, sentiment analysis is the process of predicting/extracting these ideas or feelings. We want to know if the sentiment of a piece of writing is positive, negative or neutral. Exactly what we mean by positive/negative sentiment depends on the problem we’re trying to solve.
For the BTS example, we are trying to predict a listeners opinion. Positive sentiment would mean the listener enjoyed the song. We could be using sentiment analysis to flag potential hate speech on our platform. In this case, negative sentiment would mean the text contained racist/sexist opinions. Some other examples include predicting irony/sarcasm or even a person’s intentions (i.e. are they planning to buy a product).
So, there are many types of sentiment analysis models. There are also many ways to do sentiment analysis. We’ll concentrate on applying one of these methods. That is creating a bag-of-words model using an SVM. Let’s start with the Python packages that will help us do this.
Packages
On line 2–5, we have some standard packages such as Pandas/NumPy to handle our data and Matplotlib/Seaborn to visualise it. For modelling, we use the svm package (line 7) from sci-kit learn. We also use some metrics packages (line 8) to measure the performance of our model. The last set of packages are used for text processing. They will help us clean our text data and create model features.
Dataset
To train our sentiment analysis model, we use a sample of tweets from the sentiment140 dataset. This dataset contains 1.6 million tweets that have been classified as having either a positive or negative sentiment. You can see some examples in Table 1.
Using the code below we load the entire dataset. 1,600,000 rows is a lot of data especially considering we will have to clean the text and create features from it. So, to make things more manageable, in line 9, we take a random sample of 50,000 tweets.
Text cleaning
The next step is to clean the text. We do this to remove aspects of the text that are not important and hopefully make our models more accurate. Specifically, we will make our text lower case and remove punctuation. We will also remove very common words, know as stopwords, from the text.
To do this, we have created the function below which takes a piece of text, performs the above cleaning and returns the cleaned text. In line 18, we apply this function to every tweet in our dataset. We can see some examples of how this function cleans text in Table 2.
Notice how all the cleaned tweets are lower case and have no punctuation. The words ‘there’, ‘is’, ‘on’, ‘and’, ‘so’, ‘to’, ‘be’ and ‘how’ have all been removed from the 1st tweet. These are all examples of stopwords. We would expect these words to be common in both positive and negative tweets. In other words, they will not tell us anything about the sentiment of a tweet. So, by removing them we will hopefully be left with words that do convey sentiment.
Following this, it is important to think about how text cleaning will impact your model. For some problems, things like stopwords and punctuation may be important. For example, angry customers may be more likely to use exclamation marks!!! If you are not sure, you can always treat text cleaning as a hyperparameter. You could train models using both stopwords and no stopwords and see the impact on accuracy.
Feature engineering (bag-of-words)
Even after cleaning, like all ML models, SVMs cannot understand text. What we mean by this is that our model cannot take in the raw text as an input. We have to first represent the text in a mathematical way. In other words, we must transform the tweets into model features. One way to do this is by using N-grams.
N-grams are sets of N consecutive words. In Figure 2, we see an example of how a sentence is broken down into 1-grams (unigrams) and 2-grams (bigrams). Unigrams are just the individual words in the sentence. Bigrams are the set of all two consecutive words. Trigrams (3-grams) would be the set of all 3 consecutive words and so on. You can represent text mathematically by simply counting the number of times certain N-grams occur.
For our problem, we take the 1000 most common unigrams/ bigrams from our tweets. That is, we count the number of times these N-grams occur in our corpus of cleaned tweets and take the top 1000. To create model features, we then count the number of times these N-grams occur in each of the tweets. This approach is known as bag-of-words.
Table 3 gives an example of a feature matrix created using this approach. The top row gives each of the 1000 N-grams. There is a numbered row for each of the tweets. The numbers within the matrix give the number of times that N-gram occurs within the tweet. For example, “sorry” occurs once in tweet 2. Essentially, we are representing each of tweets as a vector. In other words, we are vectorising our tweets using counts of N-grams.
The code below is used to create one of these feature matrices. We start by splitting our dataset into a training (80%) and testing (20%) set. In line 6, we define a CountVectoriser that will use vectorise our tweets using the top 1000 unigrams/bigrams. In line 7, we use this to vectorize our training set. The .fit_transform() function will first obtain the 1000 most common N-grams and then count the number of times they occur in each tweet.
We follow a similar process to vectorise our testing set. In this case, we use the .transform() function. This will count the number of times each N-gram occurs using the same list as the training set. It is important to use the same list of N-grams to vectorise each set. Using a different list for the testing set would cause the model to make incorrect predictions.
Finally, we scale the feature matrix using min-max scaling. This ensures all our features are in the same range. This is important as SVMs can be influenced by features with large values. As with the list of N-grams, we scale both sets in the same way (i.e. using the max and min values from the training set).
We have transformed our testing set, using N-grams and scaling weights obtained from the training set. As mentioned, this is done so that both sets are vectorised in the same way. It is also done to avoid data leakage. In practice, our model will be used on new/unseen tweets. These tweets, along with their N-grams and weights, would not be available during training. So, to get a better indication of future performance our model should be tested on a set that has been treated as unseen.
Modelling
With our training and testing set ready, we can now train our model. We do this in line 2 in the code below. Here we train an SVM on our training set. Specifically, we use an SVM with a linear kernel and set the penalty parameter to 1. In line 5, we use this model to make predictions on the testing set and in line 8 we calculate the accuracy of these predictions.
In the end, the model had an accuracy of 73.4% on the testing set. We can dive a bit deeper into the model’s performance by looking at the confusion matrix in Figure 2. There are 915 false negatives compare to 1747 false positives. In other words, most of the errors come from the model incorrectly predicting tweets with a negative sentiment as having a positive sentiment. So, for a first draft, our model is not too bad but there is a lot of room for improvement.
We can improve our model’s performance in a few ways. We can spend more time tunning the model’s hyperparameters. As mentioned above, we have set the penalty parameter to 1. This was actually chosen after testing a few different values (i.e. 0.001, 0.01, 0.1, 1 and 10) and seeing which one had the highest k-fold cross-validation accuracy. Other hyperparameters, such as the kernel and text cleaning steps, can be tuned in the same way. We could also interpret our model, figure out how it works and make changes based on these findings.
Interpreting our model
One way of interpreting an SVM is by looking at the model weights/ coefficients. Through the process of training the SVM, every N-gram in the training set is given a weighting. N-grams with positive weighting are associated with positive sentiment. Similarly, those with negative weights are associated with negative sentiment.
In Figure 3, we visualise the coefficient of 15 of the 1000 N-grams. The first 5 all have high positive coefficients. This makes sense as you probably expected tweets with words like ‘happy’ or ‘smile’ to have a positive sentiment. Similarly, the words with negative coefficients: ‘bored’, ‘hate’, etc… would be associated with negative sentiment. Notice that there are also N-grams that have coefficients close to 0.
N-grams with small coefficients would not have much impact on our model’s predictions. The coefficients could be small because the N-grams tend to occur in tweets with both positive and negative sentiment. In other words, they do not tell us anything about a tweet’s sentiment. Like with stopwords, we could remove these words and hopefully improve the performance of our model.
Hyper-parameter tuning and model interpretation are some of the many ways we can improve accuracy. You may also get better results by experimenting with different models like Neural Networks. Instead of bag-of-words, you could use more advanced techniques, like word embeddings, to vectorise the tweets. There are so many options and hopefully, this article has given you a good starting point.