Before we address the cost function, I would like to outline the binary cross-entropy loss briefly.
To begin with, let us say we have a distribution P and a distribution Q. Our ML model has come up with Q in an effort to try and map P. Then, we can say that the Entropy for P is
for each datapoint in P. Now, if we try to map Q onto P (to see how well Q models P), we can find the cross-entropy as
Taking the difference of the cross entropy and the entropy, we can find the Kull-back Leibler Divergence (or just KL Divergence) which is a measure of the dissimilarity between two distributions.
This means the closer Q gets to P, the lower the KL Divergence will be.
And the whole point of our machine learning model is to come up with a good Q distribution that maps P well, and hence minimizes this dissimilarity!
If we somehow miraculously map Q onto P perfectly, then our cross entropy Eq will be equal to Ep and therefore KL Div. will be zero. However, that will probably not happen meaning our Q will always be a little imperfect in mapping P, so Eq will almost always be > Ep and we will always have some positive value for the KL divergence.
If we take this formula for each point in the distribution (as we do when we find the loss of a model), we get the following formula for Binary Cross-entropy:
All said and done, BCE is best suited for classification tasks. Therefore, after customizing the loss function further to our purpose (with the addition of the minimax optimization objectives), the researchers arrived at the following:
Let us understand each part of this equation.
- The term below represents the log of the output of the discriminator when fed with real samples from our training distribution x.
2. The term below represents the logarithm of the complement of the probability output by the discriminator when fed with fake samples generated from random noise z.
Here since we now have the log probabilities instead of just the probabilities, we can tweak our optimization objectives just a skosh.
When training the discriminator, we wish to maximize log (D(x)) as well as log (1 — D(G(z))).
When training the generator, we wish to minimize log (1 — D(G(z))).
PS: Try the math in your head; the optimization taking place is still the same!
Now that we’re armed with what exactly our loss function is measuring and how we’re planning on optimizing that, we have a pretty firm overall idea of how this system works. Let us check out some of the incredible stuff that GANs have been used to build so far!