*GOAL: Finding true training distribution!*

In this post, we’ll discuss the latent based model for image generation. Objective is to generate *natural, realistic and high quality* images using latent based generative model.

The blog post is meant to give a superficial explanation on different latent model used for image generation. I recommend reading papers for more details.

**Latent based models**

Models which learns to represent the high dimensional data using low dimensional space i.e. latent space (Z) using ** bottleneck architecture** are called

**. Models like**

*Latent model*s**Autoencoder, variational autoencoder**etc are the pioneers of latent based model.

**Autoencoder**

It consists of an

encoderanddecodernetwork. Encoder network takes an input X and learns to represent the data in a lower dimensional space Z and then Z is passed to an decoder to reconstruct the approximate input X. The Z is a vector and in the image below,the MNIST input is transformed into a point in two dimensional space, which helps us in recovering the mnist image from the decoder network. The network is optimized usingmean squared errorcalled asreconstruction loss.

**Variational Autoencoder**

VAE model has a similar architecture as Autoencoder, with only difference is how the output from encoder is mapped to a probability distribution(normal). Encoder in autoencoder generates a vector Z, in VAE the encoder outputs

mean and standard deviation(std). We create anormal distributionbased on thesemeanandstd, and use it to generate asample vector Zto pass it to decoder network. Two loss function are optimized in VAE, first is thereconstruction lossand second is theKL Divergence loss, whichminimizes the difference between two probability distribution.For optimizing the KL Divergence, we compare the encoder’s distribution with standard normal distribution with mean = 0 and std = 1.VAE’s are trained to mimic the training distribution and any bias in training set makes VAE to generate similar biased images.

*The estimated mean and standard deviation is optimized w.r.t standard normal distribution of mean=0 and std=1.*

**NVAE: A Deep Hierarchial Variational Autoencoder**

It utilizes a

architecture for image generation withDeep Hierarchial VAEanddepthwise separable convolutions. It usesbatch normalizationmeans residual network (residual parameterization) for finding the parameters andResNetspectral regularizationfor training stabilization. The concepts used in NVAE are not new but lot of creative thinking & careful engineering has been put into building this network.Residual parameterization, if one can recall ResNet’s key idea of

skip connectionwas to overcome thedegradation problem and avoiding vanishing gradient issue. VAE suffers from instability as the hierarchy of the network becomes large, that’s why residual network comes into play.Note, a regular convolution is performed on the encoder network. The objective function remains same as VAE.

Similar to other latent models, NVAE has a ** bidirectional encoder and a generative model which plays the role of decoder**. Encoder consists of

**, making sequence of latent sample generation**

*hierarchical group of resnet blocks***{z_1, z_2.. z_l}, which are tied up together using feature combination(concatenation). Encoder’s goal is to generate an approximate distribution of Z’s from X’s i.e**

*Z’s***q(z|x).**

*The weights are shared from bottom-up*to infer the latent variables Z’s.In Decoder, ** the top-down model takes a sample from each group of Z’s and combined with deterministic features maps learned from resnet block and passed into next group of Z.** The weight is shared across each hierarchy group.

The success of NVAE is based on the engineering aspects of neural network architecture. NVAE is the first non-autoregressive model which competes with autoregressive model (Current SOTA model) for high quality image generation.

I will share complete engineering details in my website

**Reference**