Generative Adversarial Networks (GAN): An Intuitive Introduction

Counterfeit items sold on the streets in South Korea [Photo by Adli Wahid on Unsplash]

Have you watched the movie Catch Me If You Can, starring Leonardo DiCaprio? It was based on real life story of a conman Frank Abagnale, who forged payroll cheques worth millions of dollars. He became so good at it, that eventually the FBI turned to him to help catch other forgers.

GAN are very much like an adversarial game of a conman vs FBI. Here we have a generator who tries to “forge” stuffs — Van Goghs’s paintings, Shakespeare’s novels, Beethoven’s piano composition — or imitate the playing style of the late pianist Glenn Gould. And there we have a discriminator who tries to identify whether an item is forged or genuine.

Initially, the generator (conman) produced crappy imitations. The discriminator (FBI) identified these fakes and caught the conman easily. But after countless times in and out of the jail and refining his designs, the conman learns to produce more and more realistic-looking imitations. At one point, his imitation became indistinguishable from the genuine stuffs.

Most of us studying machine learning are probably more familiar with discriminative models: binary classifiers, multi-class classifiers, multi-label classifiers and so on. A classifier captures the conditional probability P(c|x): probability of a class label c given the observation x. Generative models, on the other hand, captures the probability P(x) or joint probability P(x,c) of observation.

Generative models draw samples from a latent space to synthesize new data. Latent space is a multi-dimensional space whereby each point in the space corresponds to an observation (i.e. synthesized data). In a simple analogy, it can be thought of as numerical representation of domain characteristics. For example, the representation for facial images would be something like:

two eyes, one nose, one mouth
eyes are next to each other, located above the nose
mouth is located below the noise, and determines whether the face is smiling, neutral or frowning
facial features could represent male or female
there could be accessories like glasses, masks, make up and so on …

Each of this characteristic is represented by a dimension (variable) in the space. Such representation allows very interesting latent space vector arithmetic. In this paper, the authors perform latent vector arithmetics on facial image as show below. The vectors represent image characteristic of man, woman and glasses and can be manipulated arithmetically with convincing output.

Latent vector arithmetic, where the resulting vector is input to a generator to generate the output image. [Image taken from this paper]

Training a generative model to synthesize data in a specific domain is akin to learning the latent space representation of the domain. Subsequently, the model generates synthetic data, each one slightly different, by sampling from the unlimited number of points in the learnt space.

In a more formal definition, a generative model 𝐺 parameterized by 𝜃 takes random noise vector z as input and generates output sample 𝐺 (z; 𝜃 ) with probability distribution p_generated. A good generative model has the ability to generate new, plausible samples which are indistinguishable from the real input samples. In other words, p_generated ∼ p_input.

In this paper, the authors presented an interesting example to illustrate why generative models are worth studying. Generative models, and GANs in particular, enable machine learning to work with multi-modal outputs. For example, let us take a regression model trained to minimize the mean square error between the actual and predicted output. Such model can only produce a single predicted output for each input sample. However for many tasks, a single input may correspond to many different correct answers, each of which is acceptable. The figure bellow illustrates computer rendering of predicted frame of a video sequence.

The output of next frame prediction task from a video sequence, showing the outputs from a model trained with only mean square error loss (MSE) and additinal GAN loss (Adversarial). [Image taken from this paper]

In this task, a model was trained to predict the next frame in a video sequence. The image on the left is the ground truth, or target output in the usual sense of supervised training. The image in the center shows the predicted output of a model trained with mean square error (MSE). The model is forced to output a single answer of what the next frame looks like. Because there are many possible futures, corresponding to slightly different positions of the head, the single answer that the model chooses corresponds to an average over many slightly different images. This causes the ears to practically vanish and the eyes to become blurry.

Using an additional GAN loss, the model is able to understand that there are many possible outputs. The generative model, having learnt the distribution of the output images, is able to sample from one of the possible outputs. The result is the image on the right which is sharp and recognizable as a realistic, detailed image.

GAN is a deep learning framework to train generative models via an adversarial process.The basic framework is illustrated in the image below, and consists of:

a generator network G which takes a random noise vector z to generate synthetic data G(z), and
a discriminator network D which takes real data x and synthetic data and detect whether a particular data sample is real or fake.

GAN framework with generator and discriminator networks. [Image by author]

Both networks are trained in such a way that the generator eventually synthesizes data samples which the discriminator is unable to distinguish from real data samples. GAN has been successfully used in computer vision applications, especially for image synthesis and image-to-image translation. Many architecture variations have been proposed over the years. However, the framework is known to be unstable and hard to train. It suffers from vanishing gradient and mode collapse problems during network training. The solution to such problems is still an active research area.

The training is formulated as a two-player minimax game. Here, the generator 𝐺 tries to minimize the gap between the synthetic and real samples so as to fool the discriminator, while discriminator 𝐷 tries to maximize its understanding of the synthetic samples so as to distinguish between real and synthetic data better.

Intuitively, the generator can be thought of as analogous to a team of counterfeiters trying to produce fake checks and use it without detection. The discriminator is analogous to the FBI, trying to detect the counterfeit currency. The adversarial training process is analogous to a competition, in which both teams iteratively improve their methods until the counterfeits are indistiguishable from the genuine articles.

The two networks are trained in an alternating steps. The discriminator network is trained for one or more epochs while keeping the generator network constants, and vice versa. The steps are repeated until the discriminator has a 50% accuracy. At this point, the generated images are indistinguishable from the real image and the discriminator is merely guessing.

4.1. Discriminator training phase.

The discriminator classifies both real data and fake data from the generator. Hence, it is simply a binary classifier which aims to maximize the probability of correctly classifying real and synthesized input. The objective can be defined by cross-entropy loss.

Discriminator training. [Image by author]

4.2 Generator training phase

Training for the generator network is slightly more complicated. It involves both the generator and discriminator networks. An iteration of generator training involves the following steps:

Sample random noise.
Produce generator output.
Pass the output to discriminator.
Calculate the binary classification loss.
Back-propagate through both discriminator and generator network.
Update only the generator weights.

Generator training. [Image by author]

Many variation of GAN have proposed to perform various tasks. Here we will look at several basic variations. This section provides a brief overview with links to the original papers for more detailed readings. It is relatively more technical and requires a basic understanding of neural network architectures.

5.1 Deep Convolutional GAN (DCGAN)

The generator and discriminator in the original GAN was constructed with fully-connected network. However, GAN has a wide application in the field of image synthesis, where convolutional neural networks (CNN) are native. Hence, a family of network architectures called DCGAN (deep convolutional GAN) was proposed. It allows training a pair of deep convolutional generator and discriminator networks.

Typically, convolutional networks uses spatial pooling layers such as max and average pool to upsample or downsample the image as it moves along the network. DCGAN make use of strided and fractionally-strided convolutions instead, which are learnable. This allows the spatial downsampling and upsampling operators to be learned during training. These operators handle the change in sampling rates and locations, a key requirement in mapping from image space to possibly lower dimensional latent space, and from image space to a discriminator.

In the paper, the authors presented a deep convolutional generator structure designed for their experimental need. The structure is shown in the figure below. The first layer of the generator takes a 100-dimensional uniformly distributed noise 𝑧 as input to a fully-connected network. The result is reshaped into a 4-dimensional tensor and used as the start of the convolution stack. A series of four fractionally-strided convolutions then convert this high level representation into a 64 × 64 pixel image. For the discriminator, the last convolution layer is flattened and then fed into a single sigmoid output

Deep convolutional generator structure for DCGAN. [Image taken from this paper]

5.2 Conditional GAN (CGAN)

In the original GANs, there is no control over the generated images since they only depends on random noises as input. A conditional version of GAN was proposed shortly after the introduction of GAN. The generation process can be directed by conditioning the model on additional information. Typically, the conditional vector y is concatenated with the noise vector z and the resulting vector is input to the original GAN.

The conditional vector is arbitrary. For example, in CGAN paper, the authors use one-hot vectors of class label to condition GAN to generate MNIST digits. They also demonstrated automatic tagging of images conditioned by feature vector of the image as extracted by the convolution layer of pre-trained AlexNet.

Structure of CGAN, in which the generator and discriminator are conditioned by additional vector y. [Image taken from this paper]

5.3 Auxiliary Classifier GAN (ACGAN)

ACGAN is an extension of the conditional GAN that modifies the discriminator by adding an additional task: to predict the class label of the given image on top of the real/fake classification. Forcing a model to perform additional tasks (multitask learning) is known to improve performance on the original task. It has the effect of stabilizing the training process and allowing the generation of large high-quality images whilst learning a representation in the latent space that is independent of the class label. The structure of ACGAN is shown below. The authors demonstrated that this method can help to generate sharper image with higher resolution.

Structure of ACGAN in image synthesis, where the discriminator performs additional task of predicting the class label of image. [Figure taken from this paper]

5.4 Bidirectional GAN (BiGAN)

GAN can transform a noise vector z from a simple latent distribution into synthetic data samples 𝐺(z) with fairly complex data distribution. It was demonstrated that latent space of such generators captures semantic variation in the data distribution. Intuitively, models trained to predict these semantic latent representations given data may serve as useful feature representations for auxiliary problems where semantics are relevant. However, GANs lacks the ability to map data sample x into latent feature z.

BiGAN was proposed as a mean of learning this inverse mapping. It was demonstrated that the resulting learned feature representation is useful for auxiliary supervised discrimination tasks, competitive with contemporary approaches to unsupervised and self-supervised feature learning.

The structure of BiGAN is shown below. In addition to the generator 𝐺 from the standard GAN framework, BiGAN includes an encoder 𝐸 which maps data x to latent representations z. The BiGAN discriminator 𝐷 discriminates not only in data space (x versus 𝐺 (z)), but jointly in data and latent space (tuples (x, 𝐸(x)) versus (𝐺(z), z)), where the latent component is either an encoder output 𝐸(x) or a generator input z.

Structure of BiGAN. [Image taken from this paper]

GAN has been the backbone of many real world applications, especially in image synthesis. It is also a highly active research area. Some interesting applications are discussed here.

6.1 Image super resolution

In this application, GAN is utilized to reconstruct a higher resolution image based on low resolution input. The first of such framework was introduced in 2017, capable of inferring realistic natural images for 4x upscaling operation.

The quality of image upscaling (4x) comparing bicubic interpolation, super-resolution ResNet optimized with MSE, and super-resolution with GAN optimized with perceptual loss. [Image taken from this paper]

In this paper, GAN training methodology was introduced in the context of generation of plausible facial image. The key idea of this approach is to gradually increase the generator and discriminator, which starts from a low resolution and adds new layers as the training progresses to make the model increase fine details. The generated images were impressive and very realistic. Since then, the image quality and resolution keeps improving.

Recent models: Progressive Face Super Resolution (2019), super-resolution for facial images from surveillance camera (2021)

6.2 Image inpainting

Image inpainting refers to the technique of restoring and reconstructing images based on background information. The generated images are expected to look very natural and difficult to distinguish from the ground truth. High-quality image inpainting not only requires the semantics of the generated content to be reasonable but also requires that the texture of generated image clear and realistic enough. For example, ExGAN is able to modify facial image with closed eyes into one with opened eyes while preserving the subject’s identity.

Result from eye opening ExGAN model. [Image taken from this paper]

Recent models: Deepfillv2 (2019), EdgeConnect (2019)

6.3 Image-to-image translation

The original GAN was proposed to generate image from noise. Image-to- image GAN generate diverse images from images. The goal of image translation is to learn the mapping from the source image domain to the target image domain, which changes the style or some other properties of the source domain to the target domain while keeping the image content unchanged.

In 2017, pix2pix software was released in association with a work on multi-purpose image-to-image translation with conditional GAN. It was able to generate, for example, realistic images from sketches.

Results from pix2pix, enabling realistic image generation and transformation. [Image taken from this paper]

Recent models: pix2pixHD-Aug (2020), toon2real (2021)