The Ultimate Performance Metric in NLP

ROUGE is actually a set of metrics, rather than just one. We will cover the main ones that are most likely to be used, starting with ROUGE-N.

ROUGE-N

ROUGE-N measures the number of matching ‘n-grams’ between our model-generated text and a ‘reference’.

An n-gram is simply a grouping of tokens/words. A unigram (1-gram) would consist of a single word. A bigram (2-gram) consists of two consecutive words:

The reference is a human-made best-case output — so for automated summarization is would be a human-made summary of our input text. For machine translation, it would be a professional translation of our input text.

With ROUGE-N, the N represents the n-gram that we are using. For ROUGE-1 we would be measuring the match-rate of unigrams between our model output and reference.

ROUGE-2 and ROUGE-3 would use bigrams and trigrams respectively.

Once we have decided which N to use — we now decide on whether we’d like to calculate the ROUGE recall, precision, or F1 score.

Recall

The recall counts the number of overlapping n-grams found in both the model output and reference — then divides this number by the total number of n-grams in the reference. It looks like this:

The calculation of our ROUGE-N recall metric for a single sample, in plain English (top) and simplified notation (bottom)

This is great for ensuring our model is capturing all of the information contained in the reference — but this isn’t so great at ensuring our model isn’t just pushing out a huge number of words to game the recall score:

Our model could just output every single word in our vocabulary and get a perfect recall score every time without fail

Precision

To avoid this we use the precision metric — which is calculated in almost the exact same way, but rather than dividing by the reference n-gram count, we divide by the model n-gram count.

So if we apply this to our previous example, we get a precision score of just 43%:

F1-Score

Now that we both the recall and precision values, we can use them to calculate our ROUGE F1 score like so:

Let’s apply that again to our previous example:

That gives us a reliable measure of our model performance that relies not only on the model capturing as many words as possible (recall) but doing so without outputting irrelevant words (precision).

ROUGE-L

ROUGE-L measures the longest common subsequence (LCS) between our model output and reference. All this means is that we count the longest sequence of tokens that is shared between both:

The idea here is that a longer shared sequence would indicate more similarity between the two sequences. We can apply our recall and precision calculations just like before — but this time we replace the match with LCS:

Our LCS recall calculation

Precision is much the same but we switch our total n-gram count from the reference to the model

And finally, we calculate the F1 score just like we did before

ROUGE-S

The final ROUGE metric we will look at is the ROUGE-S — or skip-gram concurrence metric.

Now, this metric seems to be much less popular than ROUGE-N and ROUGE-L covered already — but it’s worth being aware of what it does.

Using the skip-gram metric allows us to search for consecutive words from the reference text, that appear in the model output but are separated by one-or-more other words.

So, if we took the bigram “the fox”, our original ROUGE-2 metric would only match this if this exact sequence was found in the model output. If the model instead outputs “the brown fox” — no match would be found.

ROUGE-S allows us to add a degree of leniency to our n-gram matching. For our bigram example we could match by using a skip-bigram measure:

We calculate recall just like we did with ROUGE-N — but we add in leniency for any words appearing between matches

The same applies to our precision metric too

After calculating our recall and precision, we can calculate the F1 score too just as we did before.

Cons

ROUGE is a great evaluation metric but comes with some drawbacks. In-particular, ROUGE does not cater for different words that have the same meaning — as it measures syntactical matches rather than semantics.

So, if we had two sequences that had the same meaning — but used different words to express that meaning — they could be assigned a low ROUGE score.

This can be offset slightly by using several references and taking the average score, but this will not solve the problem entirely.

Nonetheless, it’s a good metric for assessing both machine translation and automatic summarization tasks and is very popular for both.