ROUGE is actually a set of metrics, rather than just one. We will cover the main ones that are most likely to be used, starting with ROUGE-N.
ROUGE-N
ROUGE-N measures the number of matching ‘n-grams’ between our model-generated text and a ‘reference’.
An n-gram is simply a grouping of tokens/words. A unigram (1-gram) would consist of a single word. A bigram (2-gram) consists of two consecutive words:
The reference is a human-made best-case output — so for automated summarization is would be a human-made summary of our input text. For machine translation, it would be a professional translation of our input text.
With ROUGE-N, the N represents the n-gram that we are using. For ROUGE-1 we would be measuring the match-rate of unigrams between our model output and reference.
ROUGE-2 and ROUGE-3 would use bigrams and trigrams respectively.
Once we have decided which N to use — we now decide on whether we’d like to calculate the ROUGE recall, precision, or F1 score.
Recall
The recall counts the number of overlapping n-grams found in both the model output and reference — then divides this number by the total number of n-grams in the reference. It looks like this:
This is great for ensuring our model is capturing all of the information contained in the reference — but this isn’t so great at ensuring our model isn’t just pushing out a huge number of words to game the recall score:
Precision
To avoid this we use the precision metric — which is calculated in almost the exact same way, but rather than dividing by the reference n-gram count, we divide by the model n-gram count.
So if we apply this to our previous example, we get a precision score of just 43%:
F1-Score
Now that we both the recall and precision values, we can use them to calculate our ROUGE F1 score like so:
Let’s apply that again to our previous example:
That gives us a reliable measure of our model performance that relies not only on the model capturing as many words as possible (recall) but doing so without outputting irrelevant words (precision).
ROUGE-L
ROUGE-L measures the longest common subsequence (LCS) between our model output and reference. All this means is that we count the longest sequence of tokens that is shared between both:
The idea here is that a longer shared sequence would indicate more similarity between the two sequences. We can apply our recall and precision calculations just like before — but this time we replace the match with LCS:
ROUGE-S
The final ROUGE metric we will look at is the ROUGE-S — or skip-gram concurrence metric.
Now, this metric seems to be much less popular than ROUGE-N and ROUGE-L covered already — but it’s worth being aware of what it does.
Using the skip-gram metric allows us to search for consecutive words from the reference text, that appear in the model output but are separated by one-or-more other words.
So, if we took the bigram “the fox”, our original ROUGE-2 metric would only match this if this exact sequence was found in the model output. If the model instead outputs “the brown fox” — no match would be found.
ROUGE-S allows us to add a degree of leniency to our n-gram matching. For our bigram example we could match by using a skip-bigram measure:
After calculating our recall and precision, we can calculate the F1 score too just as we did before.
Cons
ROUGE is a great evaluation metric but comes with some drawbacks. In-particular, ROUGE does not cater for different words that have the same meaning — as it measures syntactical matches rather than semantics.
So, if we had two sequences that had the same meaning — but used different words to express that meaning — they could be assigned a low ROUGE score.
This can be offset slightly by using several references and taking the average score, but this will not solve the problem entirely.
Nonetheless, it’s a good metric for assessing both machine translation and automatic summarization tasks and is very popular for both.