Tl;DR: They combined the efficiency of GANs and convolutional approaches with the expressivity of transformers to produce a powerful and time-efficient method for semantically-guided high-quality image synthesis.
If the title and subtitle sound like another language to you, this article was made for you!
Image-GPT
You’ve probably heard of iGPT, or Image-GPT recently published by OpenAI that I covered on my channel. It is the state-of-the-art generative transformer model. OpenAI used the transformer architecture on a pixel-representation of images to perform image synthesis. In short, they use transformers with half the pixels of an image as inputs to generate the other half of the image. As you can see here, it is extremely powerful.
However, as you know, there are 4k high-resolution images and videos. And do you know how many pixels there are in one 4k image?
It counts in millions and even tens of millions. Which is a pretty long sequence when compared with a simple phrase or a paragraph for natural language processing applications. Because transformers are designed to learn long-range interactions on sequential data, which in this case would be to use all the pixels sequentially, their approach is excessively demanding in computation and does not scale beyond 192 x 192 image resolutions.
So transformers cannot be used with images since no one wants to generate a super low definition image, right?
Well, not really.
Researchers from the Heidelberg University in Germany recently published a new paper combining the efficiency of convolutional approaches with the expressivity of transformers to produce a semantically-guided synthesis of high-quality images. Meaning that they used a convolutional neural network to obtain context-rich representations of images to then use this representation instead of the actual image to train a transformer model to synthesize an actual image from it, allowing much higher image resolution than iGPT while conserving the quality of the resulted image. But we will come back to that in a minute with a better explanation.
If you are not familiar with CNNs or transformers, I would strongly recommend you to watch the videos I made explaining them to have a better understanding of this approach.
This paper is called “Taming Transformers for High-Resolution Image Synthesis” and, as I said, it enables transformers to synthesize high-resolution images from semantic images, just like you can see here. Where the only information needed is an approximative semantic segmentation showing what kind of environment you would like at which position in the image, and it will output a complete high definition image filling the segmentations with real mountains, grass, sky, sunsets, etc.
Now, the question is, why are these researchers and OpenAI using a transformer instead of our typical GAN architectures for image synthesis?
Well, the advantages of using transformers for image generation is clear:
1. They continue to show state-of-the-art results on a wide variety of tasks and are extremely promising.
2. They contain no inductive bias found in CNNs where the use of two-dimensional images and filters causes a prioritization of local interactions. This inductive bias is what makes CNNs so efficient, but it may be too restrictive to make the network “expressive”, or “original”.
Now that we know that transformers are more “expressive” and very powerful, the only thing left is to find a way to make it more efficient. Indeed, in their approach, they achieved to use both this high effectiveness caused by inductive bias coming from CNNs as well as the expressivity of transformers.
As I said, the convolutional neural network architecture, composed of a classic encoder-decoder and an adversarial training procedure using a discriminator which they called VQGAN, is used to generate an efficient and rich representation of the images in the form of a codebook. As the name suggests, it is a GAN architecture that is used to train a generator to generate a high-resolution image. If you are not familiar with how GANs work, you can watch the video I made explaining them.
Once this first training is done, they take only the decoder that is then used to represent the encoded information of the input image as input for the transformer, here referred to as a codebook. Such that rather than directly using the pixels of the image, the transformer uses this codebook containing a representation of the image in the form of a composition of perceptually rich image constituents. Of course, this codebook is composed of extremely compressed data made so it can be read sequentially by the transformer.
Then, using this representation as a training dataset for the transformer, it learns to predict the distribution of possible next indices inside this representation just like a regular autoregressive model. Meaning that it automatically builds a regression equation that uses previous time steps as inputs to predict the values of the future time steps. Therefore combining CNNs and GANs with transformers to perform high-resolution image synthesis.
Here, you can find the demo version of their code that you can try right now on google colab without having to set-up anything. They already made the setup and you just have to run these few lines. It downloads their code from GitHub and installs the required dependencies automatically. Then, it loads the model and imports a pre-trained version of it. Finally, you can use their segmented image as a test or upload your own segmented image, run a few more lines to encode the segmentation. I remind you that this is a necessary step for the transformer to create a specific codebook associated with your image. And you can finally produce this superb high-quality image!
Watch more results!
Conclusion
Of course, this was just an overview of this new paper. I strongly recommend reading it for a better technical understanding. Also, as I mentioned earlier, their code is available on GitHub with pre-trained models, so you can try it yourself and even improve it! All the links are in the references below.
If you like my work and want to stay up-to-date with AI technologies, you should definitely follow me on my social media channels.