Natural Language Processing
Using the “Valence Aware Dictionary and sEntiment Reasoner” on the IMDB Reviews Dataset for Rule-based Sentiment Analysis
For a long time, I have been writing on statistical NLP topics and sharing tutorials. The sub-field of statistical NLP is responsible for several impressive advancements in the field of natural language processing, and it has the highest potential among competing approaches. However, in some cases, the contribution of rule-based classical natural language processing might be sought.
In cases where researchers have deep pockets with lots of talented researchers and dealing with general problems, statistical NLP is usually the preferred way to tackle the NLP problem. But, in the following cases, the rule-based approach might be fruitful:
1 — Domain-Specific Problem:
We have great pre-trained models such as GPT-3, BERT, ELMo, which do wonders on generic language problems. However, when we try to use them in domain-specific problems such as financial news sentiment analysis or legal text classification, the specificity required for such tasks may not be satisfied by these state-of-the-art models. Therefore, we either have to fine-tune these models with additional labeled data or rely on rule-based models.
2 — Lack of Labeled Data:
Even though we might want to fine-tune a model, it may not always be possible. Especially, if you are with a small team or don’t have the funds to hire people via freelancing platforms such as Amazon Mechanical Turk, you cannot generate labeled data to fine-tune a pre-trained model, not to mention build your own deep learning model. Lastly, it may not be possible to collect a meaningful amount of data to train a deep learning model. In the end, statistical NLP models are very data-hungry.
3 — Limited Available Funding for Training:
Even though you have some available labeled specific data, training a dedicated model has its own cost. Not only that, you would need a group of star data scientists, but you also need distributed-servers to train your model, and your pockets may not be that deep.
If you have one of these issues, your best bet might be rule-based NLP, and the accuracy levels of rule-based NLPs are not as bad as you might think. In this post, we will build a simple Lexicon-based Sentiment Classifier without much tuning, and we will achieve an acceptable accuracy performance, which may be increased even further.
Before starting, though, let’s cover some basics:
Lexicon sounds like a fancy technical term, but it means a dictionary, usually in a particular domain. In other words:
A lexicon is the vocabulary of a person, language, or branch of knowledge.
In a rule-based NLP study for sentiment analysis, we need a lexicon that serves as a reference manual to measure the sentiment of a chunk of text (e.g., word, phrase, sentence, paragraph, full text). Lexicon-based sentiment analysis can be as simple as positive-labeled words minus negative-labeled words to see if a text has a positive sentiment. It can also be very complex with negation rules, distance calculations, added-variance, and several additional rules. One of the main differences between rule-based NLP and statistical NLP is that in rule-based NLP, the researcher is completely free to add any rule they deem useful. Therefore, in rule-based NLP, what we usually see is that highly trained experts develop theory-based rules in a particular domain and apply them to a particular problem in this particular domain.
One of the most popular rule-based sentiment analysis models is VADER. VADER, or Valence Aware Dictionary and sEntiment Reasoner, is a lexicon and rule-based sentiment analysis tool specifically attuned to sentiments expressed in social media.
VADER is like the GPT-3 of Rule-Based NLP Models.
Since it is tuned for social media content, it performs best on the content you can find on social media. However, it still offers acceptable F1 Scores on other test sets, and provides a comparable performance compared to complex statistical models such as Support Vector Machines, as you can see below:
Note that there are several alternative lexicons that you can use for your project, such as Harvard’s General Inquirer, Loughran McDonald, Hu & Liu. In this tutorial, we will adopt the VADER’s lexicon along with its methodology.
Now that you have a basic understanding of rule-based NLP models, we can proceed with our tutorial. This tutorial will approach a classic sentiment analysis problem from a rule-based NLP perspective: A Lexicon-based sentiment analysis on the IMDB Reviews Dataset.
Let’s start:
IMDB Reviews Dataset is a large movie review dataset collected and prepared by Andrew L. Maas from the popular movie rating service, IMDB. The IMDB Reviews dataset is used for binary sentiment classification, whether a review is positive or negative. It contains 25,000 movie reviews for training and 25,000 for testing. All these 50,000 reviews are labeled data that may be used for supervised deep learning. Besides, there is an additional 50,000 unlabeled reviews that we will not use in this case study. In this case study, we will only use the training dataset.
Loading and Processing the Dataset
We will start by loading the IMDB dataset by using Keras’s Data API. However, Keras provides the dataset in the encoded version. Luckily we can also load the index dictionary to decode it to original reviews. The following lines will load the encoded reviews along with the index. We will also create the reverse index for decoding:
Before decoding the entire dataset, let’s see the operation with an example:
Output: this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert redford's is an amazing actor and now the same being director norman's father came...
As you can see, we can decode our encoded reviews using the reversed index. If we can decode one review, for all the reviews, all we need is a for loop. With the code below, we will create a nested list in which we place the sentiment label and the decoded review text. We also need to do error handling due to a typo in the dataset (apparently Keras team encoded one of the words wrong :/). But, the code below handles this error as well:
Finally, we will create a pandas DataFrame from the nested list we created above: