
GOAL: Finding true training distribution!
In this post, we’ll discuss the latent based model for image generation. Objective is to generate natural, realistic and high quality images using latent based generative model.
The blog post is meant to give a superficial explanation on different latent model used for image generation. I recommend reading papers for more details.
Latent based models
Models which learns to represent the high dimensional data using low dimensional space i.e. latent space (Z) using bottleneck architecture are called Latent models. Models like Autoencoder, variational autoencoder etc are the pioneers of latent based model.
Autoencoder
It consists of an encoder and decoder network. Encoder network takes an input X and learns to represent the data in a lower dimensional space Z and then Z is passed to an decoder to reconstruct the approximate input X. The Z is a vector and in the image below, the MNIST input is transformed into a point in two dimensional space, which helps us in recovering the mnist image from the decoder network. The network is optimized using mean squared error called as reconstruction loss.
Variational Autoencoder
VAE model has a similar architecture as Autoencoder, with only difference is how the output from encoder is mapped to a probability distribution(normal). Encoder in autoencoder generates a vector Z, in VAE the encoder outputs mean and standard deviation(std) . We create a normal distribution based on these mean and std, and use it to generate a sample vector Z to pass it to decoder network. Two loss function are optimized in VAE, first is the reconstruction loss and second is the KL Divergence loss, which minimizes the difference between two probability distribution. For optimizing the KL Divergence, we compare the encoder’s distribution with standard normal distribution with mean = 0 and std = 1.
VAE’s are trained to mimic the training distribution and any bias in training set makes VAE to generate similar biased images.
The estimated mean and standard deviation is optimized w.r.t standard normal distribution of mean=0 and std=1.
NVAE: A Deep Hierarchial Variational Autoencoder
It utilizes a Deep Hierarchial VAE architecture for image generation with depthwise separable convolutions and batch normalization. It uses residual parameterization means residual network (ResNet) for finding the parameters and spectral regularization for training stabilization. The concepts used in NVAE are not new but lot of creative thinking & careful engineering has been put into building this network.
Residual parameterization, if one can recall ResNet’s key idea of skip connection was to overcome the degradation problem and avoiding vanishing gradient issue. VAE suffers from instability as the hierarchy of the network becomes large, that’s why residual network comes into play.
Note, a regular convolution is performed on the encoder network. The objective function remains same as VAE.
Similar to other latent models, NVAE has a bidirectional encoder and a generative model which plays the role of decoder. Encoder consists of hierarchical group of resnet blocks, making sequence of latent sample generation Z’s {z_1, z_2.. z_l}, which are tied up together using feature combination(concatenation). Encoder’s goal is to generate an approximate distribution of Z’s from X’s i.e q(z|x). The weights are shared from bottom-up to infer the latent variables Z’s.
In Decoder, the top-down model takes a sample from each group of Z’s and combined with deterministic features maps learned from resnet block and passed into next group of Z. The weight is shared across each hierarchy group.
The success of NVAE is based on the engineering aspects of neural network architecture. NVAE is the first non-autoregressive model which competes with autoregressive model (Current SOTA model) for high quality image generation.
I will share complete engineering details in my website
Reference