NLP Metrics Made Simple: The BLEU score

Let’s look at the calculation more formally. For each word w in the candidate, we count how many times it appears in the candidate (the number of tokens for this word). Let’s call this number D(w). In our example:

D(but)=1
D(love)=3
D(other)=1
D(friend)=1
D(for)=1
D(yourself)=1

For each word, we also define R(w) to be the largest number of times the word appears in any of the references. We calculate this by looking at how many times w appears in each reference, and taking the maximum value. In our example:

R(but)=1
R(love)=2 [appears twice in R3]
R(other)=0
R(friend)=0
R(for)=2 [appears twice in R2]
R(yourself)=1

Our very naïve and basic BLEU score — we’ll call this BLEU* — can be computed as the ratio of Covered candidate tokens out of the Total number of candidate tokens:

BLEU*=Covered/Total

Let’s start with the denominator Total, which is super simple to compute. It’s the number of tokens in the candidate. A fancy way to write it is:

Total = D(W1)+D(W2)+…

In our case

Total=D(but)+D(love)+D(other)+D(friend)+D(for)+D(yourself)
=1+3+1+1+1+1
=8

Now for the numerator Covered, which is the total number of covered tokens. For each word w, the number of tokens is D(w), but the coverage is limited by R(w). So if D(w)≤R(w), all D(w) tokens are covered. Otherwise only R(w) tokens are covered. The number of covered tokens for word w can simply be written as MIN(R(w), D(w)) where MIN is the minimum of the two values.

Let’s see how this works out for our example:

MIN(D(but), R(but))=MIN(1, 1)=1
MIN(D(love), R(love))=MIN(3, 2)=2
MIN(D(other), R(other))=MIN(1, 0)=0
MIN(D(friend), R(friend))=MIN(1,0)=0
MIN(D(for), R(for))=MIN(1, 2)=1
MIN(D(yourself), R(yourself))=MIN(1,1)=1

The total coverage is the sum of the above values:

Covered=1+2+0+0+1+1=5

We can finally calculate our BLEU* score for our candidate:

BLEU*(but love other love friend for love yourself)
=Covered/Total
=5/8
=0.625

The naïve BLEU* I described above is not used in practice because it has many issues that render it highly inaccurate. I’ve introduced it to give the idea behind the “true” BLEU score. Here are some of the problems with the naïve BLEU*.

First, very short translations —i.e., candidates with very few tokens— can do absurdly well although they are likely to be horrible translations. Imagine the candidate is simply the 1-word love or even the two-word but for. These candidates get a perfect BLEU* score of 1 because the tokens are nicely covered by the references.

In addition, the BLEU* metric is completely oblivious to token order. The BLEU* score for but love other love friend for love yourself is exactly the same as other love love love for friend yourself but. In languages where word order is important (English and many others) this doesn’t really make sense.

Lastly, we only calculated the BLEU* score for a single sentence. To measure the performance of our MT model, it makes sense not to rely on a single instance, but to check the performance on many sentences, and combine the scores for a more comprehensive and accurate evaluation of the model.

To rectify these issues, the true BLEU score incorporates several corrections to my naïve BLEU* calculations.

Finally, I wrote some simple Python code that computes the BLEU* score. Note that this is for educational purposes only — do not use it for industrial or academic purposes! The code includes two versions for the score which produce identical scores. The first (BLUE_star) is longer but follows the above procedure and thus easier to understand. The shorter version (BLUE_star_compact) uses (or perhaps even abuses) Python’s list comprehension and is hence more compact.

How BLEU* is calculated — for educational purposes only

I hope you liked this basic BLEU tutorial. Feel free to leave feedback below. Thank you!

Footer