They basically leverage transformers’ attention mechanism in the powerful StyleGAN2 architecture to make it even more powerful!
Watch the video:
Last week we looked at DALL-E, OpenAI’s most recent paper.
It uses a similar architecture as GPT-3 involving transformers to generate an image from text. This is a super interesting and complex task called text-to-image translation. As you can see in the video below, the results were surprisingly good compared to previous state-of-the-art techniques. This is mainly due to the use of transformers and a large amount of data.
This week we will look at a very similar task called visual generative modelling. Where the goal is to generate a complete scene in high-resolution, such as a road or a room, rather than a single face or a specific object. This is different from DALL-E since we are not generating this scene from a text but from a trained model on a specific style of scenes. Which is a bedroom in this case.
Rather, it is just like StyleGAN that is able to generate unique and non-existing human faces being trained on datasets of real faces.
The difference is that it uses this GAN architecture in a traditional generative and discriminative way with convolutional neural networks. A classic GAN architecture would have a generator trained to generate the image and a discriminator used to measure the quality of the generated images by guessing if it’s a real image coming from the dataset or a fake image generated by the first network. Both networks are typically composed of convolutional neural networks. Where the generator looks like this, mainly composed of downsampling the image using convolutions to encode it, and then up-sample the image again using convolutions to generate a new “version” of the image with the same style based on the encoding, which is why it is called StyleGAN. Then, the discriminator takes this generated image, or an image from your dataset, and tries to figure out whether it is real or generated, called fake.
Instead, they leverage transformers’ attention mechanism inside the powerful StyleGAN2 architecture to make it even more powerful. Attention is an essential feature of this network, allowing the network to draw global dependencies between input and output. In this case, between the input at the current step of the architecture and the latent code previously encoded, as we will see in a minute.
Before diving into it, if you are not familiar with transformers or attention, I suggest you watch the video I made about transformers.
For more details and a better understanding of attention, you should have a look at the video ‘Attention is all you need’ from a fellow Youtuber and inspiration of mine, Yannic Kilcher covering this amazing paper.
Alright, so we know that they used transformers and GANs together to generate better and more realistic scenes, explaining the name of this paper, GANsformer, but why and how did they do that exactly?
As for the why, they did that to generate complex and realistic scenes like this one automatically. This could be a powerful application for many industries like movies or video games requiring a lot less time and effort than having an artist create them on a computer or even make them in real-life to take a picture of it. Also, imagine how useful it could be for designers when coupled with text-to-image translation, generating many different scenes from a single text input and pressing a random button!
They used the state-of-the-art StyleGAN architecture because GANs are powerful generators when we talk about the general image. Because GANs work using convolutional neural networks, they are by nature using local information of the pixels, merging them to end up with the general information regarding the image, missing out on the long-range interaction of the far away pixels for the same reason. This causes GANs to be powerful generators for the overall style of the image. Still, they are a lot less powerful regarding the quality of the small details in the generated image for the same reason. Being unable to control the style of localized regions within the generated image itself.
This is why they had the idea to combine transformers and GANs in an architecture they called the “Bipartite Transformer”.
As GPT-3 and many other papers already proved, transformers are powerful for long-range interactions, drawing dependencies between them and understanding the context of text or images.
We can say that they simply added attention layers, which is the base of the transformers network, in between the convolutional layers of both the generator and discriminator. Thus, rather than focusing on using global information and controlling all features globally, as convolutions do by nature, they use this attention to propagate information from the local pixels to the global high-level representation and vice versa. Like other transformers applied to images, this attention layer takes the pixels’ position and the StyleGAN2 latent spaces W and Z. The latent space W is an encoding of the input into an intermediate latent space done at the beginning of the network, denoted here as A. While the encoding Z, is just the resulting features of the input at the current step in the network.
This makes the generation much more “expressive” over the whole image, especially in generating images depicting multi-object scenes.
Of course, this was just an overview of this new paper by Facebook AI Research and Stanford University. I strongly recommend reading the paper to have a better understanding of this approach. It is the first link in the references below. The code is also available and will be linked in the references as well.