To have an easy to digest intuition of cross-entropy, the best way is to follow this logic line:
Surprisal ➡️ Entropy ️ ➡️ ️Cross-entropy ➡️ Cross-Entropy Loss Function
Surprisal
Simply put, surprisal is how surprised you are to see the outcome. If we go into a forest trying to find out the first animal we encounter, a panda 🐼 would have a very high surprisal value since we are not very likely to see one, while a bird 🐦 might be very low on surprisal. See below:
Entropy
Surprisal is good to measure how a single outcome will surprise us. If we have a random event that has multiple possible outcomes, then we can easily calculate how surprised we’ll get for this random event by multiply the surprisal and the probability of each outcome and add them together. The result and you guessed it, is entropy. See below:
For a deeper explanation of the Information Entropy concept, look at this video:
Cross-Entropy
Now entropy can be easily calculated given that we DO know the probabilities for each outcome. What if we don’t know exactly, but kind of guessed (or predicted the probabilities) it? The way to measure how surprised we are is to calculate the cross-entropy value, which multiplies the real probability of an outcome with our prediction or guess’s surprisal value. Taking an extreme example, we have a loaded coin that will 99% turn-out head, but we thought it is a loaded coin that will have a 99% chance turn out tail. The cross-entropy will be quite big, as we’ll be very ‘surprised’ most of the time (we thought it would always give us tails but always give us heads instead). One thing to note here is, even if our probability prediction is not that far off, the cross-entropy is still always bigger than the entropy (only equal when you predict perfectly).
Note that the surprisal calculated for each label is based on the probability of the predicted value, not the ground truth. After all, how surprised you are is determined by how YOU perceive the outcome should be. This answer in Quora gives a more detailed introduction to cross-entropy.
Cross-Entropy Loss
Once we understand what cross-entropy is, it’s easy to wrap our brain around the cross-entropy loss. The loss function calculates the cross-entropy value between the probability vector our model predicted and the ground truth (target variable, usually in the form of a one-hot encoded vector).
Here the predicted probability vector usually comes out of a softmax activation function, and the target (ground truth) vector is a one-hot encoded vector. See the below figure:
The only difference compared to normal cross-entropy is that usually, for a multi-label classification dataset, the target is one-hot encoded, which means only one label is true. So the cross-entropy loss only cares about the surprisal of this particular label. This is not always the case, though. Recent techniques like Mixup and Label Smoothing already generate target probability vectors that are not one-hot encoded, yet cross-entropy still works perfectly in this case!