In Sentiment Analysis with Logistic Regression (Part 1), we talk about the overall approach on how to do sentiment analysis with Logistic Regression. In this post, we are going to talk about how we can do sentiment analysis with Naive Bayes.
For this topic, I’m going to talk about:
- Introduce probability and the Bayes’ Rule
- What is the Bayes’ Rule
- Naive Bayes for sentiment analysis
- Log Likelihood for dealing with numerical underflow
- Training Naive Bayes
- Inference and Testing Naive Bayes Model
- Naive Bayes Assumptions
- Optional: Naive Bayes Applications
- Optional: Error Analysis
Disclaimer: This post is based on week 2 of Natural Language Processing with Classification and Vector Spaces course on Coursera. Most of the figures below credits goes to the course copy right.
Check out my final project here: Click Link
Introduction: Imagine we have an extensive corpus of tweets that can be categorized as either positive or negative sentiment, but not both. Within that corpus, the world “happy” is sometimes being labeled positive and sometimes negative.
How do we calculate the probability of positive tweets or negative tweets?
Probability: One way to think about probabilities is by counting how frequently events occur.
- The probability of an event (either positive or negative) = # of that specific event / the sum of events.
- The sum of all probabilities by all classes has to equal 1.
Probability of the intersection
To compute the probability of 2 events happening, like “happy” and “positive” in the picture below, we would be looking at the intersection, or overlap of events. In this case red and blue boxes overlap in 3 boxes. So the answer is 3/20.
To understand the Bayes’ rule, we first talk about conditional probabilities.
Conditional probabilities help us reduce the sample search space. When we focus on conditional probabilities, instead of looking at the whole corpus, we focus on the corpus that are within the conditional section.
- For example, let’s say we want to get P(Positive | “Happy”), we only look at the # of positive words inside the “happy” word corpus, but not the entire corpus, as illustrated below.
In summary, conditional probabilities can be interpreted as the probability of an outcome B knowing that event A already happened. In other words, looking at the elements of set A, the chance that one also belongs to set B.
In mathematical formula, P(A|B) = P(A & B) / P(B).
For example, if we want to get the P(Positive| Happy), we would only need to search the blue circle.
- The numerator will be the red part where “positive” and “happy” intersect and
- The denominator will be the blue part, where all “happy” words in the corpus.
Derive Bayes’ rule
Given two equation:
- P(A|B) = P(A&B)/P(B)
- P(B|A) = P(A&B)/P(A)
Please note that the intersection probability P(A&B) exists in both formulas, so we can substitute P(A&B) using formula 2. We can then interpret P(A|B) using the second formula as:
- Bayes’ rule is based on conditional probabilities
- P(X|Y) = P(Y|X)*P(X)/P(Y)
Introduction: Naive Bayes is an example of supervised machine learning. It’s called naive because this method makes the assumption that the features we’re using for classification are all independent, which in reality is rarely the case. As we will see, however, it still works nicely as a simple method for sentiment analysis.
Step 1: We begin with two sets of corpus, one for positive tweets and the other for negative tweets.
Step 2: Extract the vocabulary or all the different words that appear in the corpus along with their counts. In other words, we count the # of words counts for each occurrence of a word in the positive corpus and in the negative corpus.
Step 3: Sum all the words in the positive corpus and in the negative corpus. See below for the first 3 steps illustration.
Step 4: we get the conditional probability for each word in each class. The sum of the conditional probabilities in each class = 1.
Some important Notes for step 4, please follow with the picture below:
- Many words have a nearly identical conditional probability. Like I, am, learning, and NLP. The words that are equally probable don’t add anything to the sentiment. They are considered as neutral words.
- Some words have a significant difference between probabilities in different classes, such as “happy”, “sad”, “not”. These are considered as power words that tend to express one sentiment or the other. They carry a lot of weights in determining the tweet sentiments.
- Some words only appear in one class, such as “because”. When this happens, we have no way of comparing between the positive and negative class, which will become a problem for the probability calculation later. To avoid this, we will smooth the probability function by using Laplacian Smoothing technique, which will be discussed in the later section.
Step 5: Calculate the Naive Bayes inference condition rule for binary classification based on the conditional probability table we have from step 4. That’s for a given new tweet,
- We will calculate the probability of words that’s in the positive class over the probability of words that’s in the negative class.
- If the ratio > 1 then it’s a positive sentiment, else it’s a negative sentiment.
See picture below for an example:
Motivation: from Step 4 in the previous section, we see sometimes we will have probability that’s in one class, but zero in the other class. This will create a problem when we later calculate the naive bayes inference rule in step 5. To avoid this, we will use a technique called Laplacian Smoothing.
The Laplacian Smoothing Approach: Please follow with the picture below
- Instead of just count the # of occurrence a word appears in a given class, we simply add 1 to the numerator to avoid 0 occurrence of a word.
- Adding 1 to all the frequencies (numerator) will result in the probability not correctly normalized by N classes. Since there are V unique words in the whole vocabulary, we will add a new term in the denominator, which is the number of unique words (V) in the whole vocabulary to account for the extra term added in the numerator.
- After the adjustment, all the probabilities in each column (class) still sum to one.
Example: Using the previous conditional probability table, and see how we apply Laplacian Smoothing to fix the zero probability for “because” in the negative class.
- First calculate the # of unique words in the whole vocabulary. In this case, V = 8
- Apply the Laplacian Smoothing formula for each of the words in each of the classes. For example, P(I|”Po”) = (3+1)/(13+8) = 0.19
- Note that the normalization by adding V to the denominator for each class will result in the sum of probabilities = 1 in each class, which is what we want.
Motivation: Words can have many shades of emotional meaning. But for the purpose of sentiment classification, they’re simplified each: neutral, positive, and negative. All can be identified by using the conditional probabilities table we created in the previous section.
Ratio of probabilities
Taking the ratio of the probability for each word in the positive class over the probability of that word in the negative class, we can identify the word’s emotional meaning. For example, for the ratio of the probabilities that
- The more it is greater than 1, the more positive that word is.
- If the ratio = 1, the word is neutral
- The more it is close to 0, the more negative that word associated with
Naive Bayes’ Inference
This ratio (in the previous section) is very essential for the Naive Bayes’ Inference for binary classification. In the real application, the Naive Bayes’ Inference is denoted as:
- The first part is the ratio of positive tweet over negative tweet in the entire corpus. It becomes important when we have an unbalanced dataset in the real life.
- The second part is the ratio of probability for each word in N class.
- The two parts together is the full Naive Bayes’ formula for binary classification.
- Naive Bayes’ is a simple, fast and powerful method we can quickly use to establish a baseline model.
- It’s a probabilistic model for classification.
Motivation: Sentiments probability calculation requires multiplication of many numbers with values between zero and one. Carrying out such multiplications on their computer runs the risk of numerical underflow when the number returned is so small it can’t be stored on your device.
Approach: To avoid that, we can apply log to the product of the probabilities. With log apply, we can turn multiplication into addition, which helps solve the numerical underflow issue.
Illustrate by an example, Given: “I am happy because I am learning”
Step 1: Calculate the log of the ratio of the probabilities for each word
Step 2: Once we have the lambda ratio for each of the words, we can inference a new tweet by calculating the log likelihood of the sentence. See below for an example.
- Since the log likelihood = 3.3, which is greater than 0, we can inference this tweet as positive.
- Before we apply the logarithm, the ratio of the probabilities is between 0 to positive infinity.
- Please note that, after the log, the ratio is now between negative infinity to positive infinity.
- Words are often emotionally ambiguous but usually we can simplify them into 3 categories: neutral, positive and negative.
- We can measure where the words fall within these 3 categories for binary classification.
- To do so, we can calculate the ratio of the conditional probabilities for each word in each category.
- To avoid numerical underflow from multiplying many probabilities between 0 and 1, we can express the ratio as a logarithm as well, called Lambda.
- We sum the log likelihood for each word in a sentence to inference their sentiment. If the log likelihood > 0, positive sentiment, < 0 then negative sentiment. = 0 neutral.
Unlike logistic regression or deep learning, there is no gradient descent involved in training the naive bayes model. Instead, we are just counting frequencies of words in a corpus.
6 steps to train Naive Bayes
Step 0: Collect and annotate corpus. For sentiment analysis, this step means to identify positive and negative tweets
Step 1: preprocess the tweets/sentences
Step 2: Compute the word vocabulary frequency table.
Step 3: calculate the conditional probability for each word by class by applying the Laplacian Smoothing.
Step 4: Get the Lambda for each word, which is the log of the ratio of probabilities for each word by class
Step 5: Get the estimation of log prior, which is the log of the ratio of # of positive documents / # of negative documents. For a balanced dataset, this log prior = 0. For an unbalanced dataset, this term will become important.
Predict using Naive Bayes
For any given new documents, we calculate the log likelihood. Please note for words that don’t find in the vocabulary, we consider them as neutral and they don’t add any weight to the calculation. We simply don’t add any number for those words.
- If the score > 0, then it’s positive
- If the score = 0, then it’s neutral
- If the score < 0, then it’s negative.
Testing Naive Bayes
After predicting the sentiment for each document in the validation set, we can compare the prediction with the true value to calculate the accuracy metric for this Naive Bayes model.
- The accuracy metric = # of the prediction match with the true label/# of documents in the validation set
- Predict the sentiments for unseen data in the validation set Xval, Yval using lambda and logprior.
- Compare the prediction with the true label Yval by calculating the accuracy metric.
- For words that do not appear in lambda(w), we consider them as neutral words and they don’t add any values to the log likelihood calculation.
Naive Bayes is a simple and fast model because it doesn’t require any custom parameters. It’s called Naive because of the simple assumptions it makes about the data.
Assumption 1: Independence
The first assumption is the independence between the predictors or features associated with each class. Naive Bayes assumes that the words in a piece of text are independent of one another, but this is typically not the case.
For example: It is sunny and hot in the Sahara desert.
- Here the word “sunny” and “hot” often appear together, so they are likely not independent.
- Also, “sunny” and “hot” often refer to beach or desert. So the words “sunny”, “hot”, and “desert’’ are likely not independent between each other in the real world.
- But Naive Bayes assumes they are independent.
Potential Issue: As a result, this will lead to potentially under or over estimates the conditional probabilities of individual words by class.
Assumption 2: Relative frequencies in corpus, especially affecting the validation set (real word data).
Another issue of Naive Bayes is that it relies on the distribution of the training dataset.
- A good data set will contain the same proportion of positive and negative documents as a random sample would. However, most of the available annotated corpora are artificially balanced.
- In the real world, positive documents occur more often than negative documents, for example non-spam appears more than spam email.
- Assuming the reality behaves as our “balanced” training dataset would result in a very optimistic or very pessimistic model.
- The assumption of independence is very difficult to guarantee in the real word. But despite that, the model works pretty well in certain situations.
- Relative frequency of classes affect the model. For example, most of the available annotated corpora are artificially balanced. However in the real world, the data could be much noisier.
- When we use Naive Bayes to predict the sentiments of a document, what we are actually doing is estimating the probability for each class by using the joint probability of the words in classes.
- The Naive Bayes formula is just the ratio between these two probabilities, the products of the priors and the likelihoods.
- We can use this ratio between conditional probabilities for much more than sentiment analysis.
- Author identification. For example, let’s say we have a large corpus with documents written by different authors. We can train a Naive Bayes model to identify whether a document is written by one or the other author.
- Spam filtering. Using information from senders, subject and contents, we can decide whether an email is spam or not.
- Information Retrieval. One of the earliest uses of Naive Bayes was filtering between relevant and irrelevant documents in a database. Given the sets of keywords in a query, we only needed to calculate the likelihood of the documents given the query. We don’t know beforehand what’s irrelevant or a relevant document looks like. So we can compute the likelihood for each document in the dataset and then store the documents based on its likelihoods. We can choose to keep the first M results or the ones that have a likelihood larger than a certain threshold.
- Word disambiguation. For example, we can identify a word “bank” is referring to river or money in a document by calculating the likelihood of the word “bank” in different classes.
Motivation: No matter what NLP method we use, we will find ourselves faced with an error, for example, a misclassified sentence. Here we will go over some possible errors in the model predictions that can be caused by these issues.
- Processing as a source of errors: Double check the actual text vs. the preprocessed text.
- Removing punctuation can represent a different meaning. For example: My beloved grandmother :(. After we remove the punctuation, it can be easy to classify this sentence as positive, but in fact it can be very negative with the sad face punctuation.
- Removing words. For example: “This is not good, because your attitude is not even close to being nice”. After we remove some stop words, the sentence can be preprocessed as [good, attitude, close, nice] and be easily classified as positive. But in fact it can be very negative.
2. Word order can affect a sentence sentiment. Consider the following 2 sentences: the not order is very important to classify the two sentences. Obviously, the first sentence is very positive, and the second is very negative. A Naive Bayes model might classify them to have the same sentiment.
3. Adversarial attacks.
Adversarial includes sarcasm, Irony and Euphemisms. Sometimes, after we preprocess a document, we will get a list of mostly negative words, but in fact the real sentence might not be negative. For example
- Actual sentence: It is a ridiculously powerful movie. The plot was gripping and I cried right through until the ending!
- Processed sentence: [ridicul, power, movi, plot, grip, cry, end]
As we can see, after the pre-process, the Naive Bayes model would classify this sentence as negative as it contains many negative words. However, this sentence is positive as it expresses how much the writer enjoys the movie.
That’s it! We’ve covered how to use Naive Bayes to predict positive and negative sentiment for tweets. As always, please feel free to check out my final project on how to put all these together in codes (Click Link)
Hope you enjoy this reading! 🙂