If you are working in science, chances are that you have encountered a density that you can only evaluate to a constant factor. If you want to sample from such a distribution, well-studied methods exist such as Markov Chain Monte Carlo or rejection sampling. You may also use importance sampling to get properties about the target distribution such as its expectation.
In this post we will use normalizing flows (that I described in a previous post) to fit the target density. Normalizing flows are particularly powerful because once trained, they allow sampling from the learned density and/or evaluate the density of new data points.
In particular, we will implement the paper Variational Inference with Normalizing Flows in about 100 lines of code.
We will focus on the section in the paper where they fit unnormalized densities. For that matter, we will use planar flows that are defined as
and for which the determinant can be computed in O(d) (where d is the dimension of the target density).
The variable z is some noise that should have the same dimension as the target density and whose base density p(z) should be easy to evaluate. The function f therefore defines a bijective mapping between some noise and the target data. When the tunable parameters u, w and b have been learned, the transformation allows sampling new data point from the target density as y=f(z), z ~ p(z) as well as to evaluate the density of the sampled points with the change of variable theorem:
The parameters u, w and b are trained by maximizing the reverse KL divergence between the density of the normalizing flow p(y) and the target density p*(y):
To sum up, we can efficiently train a normalizing flow — that by definition produces a normalized density — by minimizing the KL divergence with an unnormalized target density and then, after training we can efficiently sample new data points from the learned density.
There is one important point to mention before starting the implementation. If we go back to the definition of f, the transformation is not always bijective. The tunable parameters u need to be constrained in order to ensure bijectiveness. Fortunately, there is a way to efficiently do that and it is explained it the paper.
Finally, one pitfall of planar flows is that there is no analytical solution to invert f. That means that if we observe a new data point y, we cannot efficiently evaluate its density using the change of variable theorem:
As opposed to planar flows, more recent normalizing flow architectures allow to compute efficiently both f and its inverse which is a powerful asset. This means that once trained, you can both sample from the target density and evaluate the density of any new observed data point.