Finding a good routine for data preprocessing is the most crucial step in sentiment analysis. For me, a couple of questions occurred and led the way: Which text elements contain information? Which do not? Do some similar text elements have similar sentiments that can be unified?
Noise reduction and unification of text elements
We begin with reducing noise by replacing and removing text elements of the tweets by using regular expressions. Since numbers per se do not carry sentiments, all numbers were removed if they were not part of an alphanumeric string. One challenge is to only remove actual numbers and not parts of emoticons which sometimes also include numbers. This is were regular expressions come in handy.
When people use consecutive periods (”…”) the number of periods has shown to vary a lot. We assume that the sentiment is the same, whether someone uses e.g. two periods (”..”) or four periods (”….”). We unify each sequence of consecutive periods from 2 to n periods to a sequence of three periods (”…”).
Additionally, we apply some Twitter-specific preprocessing steps. We remove ”RT” for retweet at the beginning of some tweets. We assume that it is not relevant for the tweet’s sentiment whether it was a retweet or not. Also, we remove ”#” in front of hashtags to be able to process the actual word of the hashtag.
Another step is unifying @-mentions and hyperlinks by replacing these entities with special tokens that represented their semantic classes. This means that all hyperlinks are treated as if they were the same. The same holds for all mentions of other users.
Splitting the tweets
What we need to break each tweet into parts is some kind of tokenizer, a routine to split our tweets. Fortunately, an explicit tokenizer for tweets is provided by the NLTK library. The advantage of this tokenizer is that it recognizes Twitter-specific text elements, e.g. the single punctuation marks of emoticons are not split. This is very important for us since emoticons carry sentiments. Also, hashtags are recognized.
Since differentiation between small and capital letters is not applied in many tweets, we convert all words to lower case to unify the spelling.
Bringing words to their stem
Another feature that we need is a stemmer. Stemmers try to bring each word back to its word stem. They work by cutting off the end or the beginning of the word, taking into account a list of common prefixes and suffixes that can be found in an inflected word. The goal is to classify words not only for the exact expression but also for the other possible forms of the words we used. For example: we want the two words “great” and “greater” to be treated as one word assuming they carry the same sentiment. Since this routine is a highly specific task for each language, we use a German-specific stemmer. I checked the performance of two popular German stemmers: GermanStemmer, and CiStem. From which the latter provided the best results.
Removing meaningless words
Not all words in a sentence are needed to get the meaning or the sentiment of it. Stopwords are the words which do not add much meaning to a sentence. They can safely be ignored without losing the sentence’s sentiment. The most common stopwords in English are for example: the, is, at, which, and on. Since stopwords differ in each language, we make use of German stopwords provided by NLTK.
Make it accessible for machine learning classifiers
The final preprocessing step for the tweets is to get from the text elements to numbers to make it accessible for machine learning classifiers. To achieve this, we use NLTK’s TfidfVectorizer. This is the same as using NLTK’s CountVectorizer followed by TfidfTransformer. The former converts a collection of text documents to a matrix of token counts, while the latter transforms a count matrix to a normalized TF-IDF representation. Now we are almost done.
Prepare the target variables
One preprocessing step applies not to the tweet but the labels: we replace the sentiment labels by integers in the pattern: (positive:2), (neutral:1), and (negative:0).
Lumping everything together
Now were done and ready to train our machine learning model. This is a summary of the preprocessing steps:
- Convert all strings to lower case
- Replace hyperlinks and mentions with a token representing each class of text element
- Unify consecutive periods
- Remove number not being part of an alphanumeric string
- Remove ”RT” for retweets and ”#” from hashtags
- Tokenization with TweetTokenizer
- Remove stopwords
- Stemming of words using (German specific) CiStem
- Using TfidfVectorizer for vocabulary and vector representation of tweets
- Replace the sentiment labels with integers
This is our preprocessing routine from a single tweet to a list of text elements:
And this baby will transform our lists of text elements to a matrix of TF-IDF features:
Now we’re set and can train our model. I used sklearn’s SVM with standard parameters:
It was trained on on a fixed split of the corpus (10%) and then tested on the remaining texts.
How to measure success
For evaluating our model we use the F1 score (also F-score or F-measure). This is a measure of a test’s accuracy. The calculation base is the precision and the recall of the test. The precision is the number of correctly identified positive results divided by the number of all positive results, including
those not identified correctly. The recall is the number of correctly identified positive results divided by the number of all samples that should have been identified as positive. Having calculated these two metrics, the F1 score is the harmonic mean of the precision and recall. The best value of F1 is 1, which means perfect precision and recall. Since we have three sentiment labels (positive, neutral, negative), we can calculate the F1 scores for each of the three sentiments. Common in the field of sentiment analysis is the F1 score macro-averaged from the F1 for positive sentiments and the F1 score for negative sentiments. Additionally, we will use the actual accuracy (correct predictions/all predictions) and a confusion matrix.
How good are our predictions?
As can be observed in the table, our model is best in predicting neutral sentiments, second-best in predicting positive sentiments, and third-best in predicting negative sentiments. These results match the distribution of the training data, as tweets with neutral sentiment were most represented in the training data, followed by positive and negative tweets. This suggests that a better-distributed set or at least more data for especially negative tweets would have been beneficial for the performance. The total accuracy was tested to be 77%. Which means that for 77 out of 100 tweets the sentiment was correctly determined.
To get a better insight into the model, we take a look at the confusion matrix. In a confusion matrix, each row of the matrix represents the instances in a predicted class while each column represents the instances in an actual class. The name stems from the fact that it makes it easy to see if the model commonly mislabels one label as another.
We see that many negative tweets got classified as neutral tweets. The prediction of neutral tweets, however, works pretty well.
This code can be used to get the F1 score and the confusion matrix:
In this article you learned how to tackle sentiment analysis of German Twitter by preprocessing the data and using a machine learning classifier. This can be easily transferred to other languages provided you have a good amount of labeled tweets. The model we created and trained makes good predictions although the preprocessing of the tweets is not too complicated. If you want to further improve the preprocessing you could for example introduce a spell checker. If you want to tackle sentiment analysis with deep learning transformer models are a promising approach worth looking at. Especially pretrained transformer models like Bert or its German adaption.