Often times in machine learning problems, we encounter scenarios that require us to estimate the probability distributions of the data samples we have. These estimated probability densities can be used for a variety of purposes such as generating new data, anomaly/novelty detection, estimating the probability for a given data, conditioned sampling from the distribution, etc.
Probabilistic Generative Models (PGMs) are models that try to estimate the distribution/density for a given data and Normalizing Flows are a class of PGMs that are built on invertible transformations. These models generally allow for efficient sampling and probability estimation from the density modeled by it and most importantly they are straightforward to train.
In this first part of a two-part article, we will explore and understand the concepts and mathematical aspects of normalizing flows. In part 2, we will look at an application of a normalizing flow-based model that can be used for speech synthesis.
Before moving on to normalizing flows, let’s take a quick look at the change of variables formula as it is essential and forms the basis for normalizing flows.
Change of variables formula describes how to evaluate the probability density of a random variable which is a result of a deterministic transformation of another random variable. Let Z and X be random variables related by an invertible transformation/mapping f such that:
X is the random variable obtained by applying a transformation on Z (whose probability density is known in prior). Now, by the change of variables formula, the density of X is given by:
For the multi-variable case, the partial derivative in the above equation becomes the determinant of the Jacobian:
Normalizing flow seeks to transform one distribution into another by applying a sequence of invertible transformation functions. This is illustrated in the below figure:
Here, the distribution of random variable Z goes through a sequence of ‘K’ transformations resulting in another distribution (call the resulting random variable X). Hence, by the change of variables formula, the density or likelihood function of X is given by the below equation:
Now, why is it called a normalizing flow? Here’s the interpretation:
Normalizing: this means that after applying an invertible transformation, the change of variables gives us the normalized density.
Flow: this refers to the fact that the distribution ‘flows’ from one form to another and it means that we can create compositions of invertible transformations to create an overall complex invertible transformation.
Normalizing flow-based models, unlike autoregressive models and variational autoencoders, allow tractable marginal likelihood estimation.
Now comes the important question: How is this concept useful?
When we learn a sequence of parameterized invertible transformations to map our data samples to a known distribution, (such as a standard Gaussian) by maximizing the likelihood given by the change of variables theorem, we can generate new data by applying the inverse transformation.
This is possible because all our transformations were invertible! If we were to sample some points from the Gaussian distribution and subject them to the inverse transformations, we would obtain points that seem as if they were sampled from the distribution that describes the original data that we had.
In Part-2 of this article, we shall look at an example and explain how a normalizing flow-based model can be used to synthesize speech by exploiting the inverse transformation.
Stay tuned, Cheers!