OpenAI successfully trained a network able to generate images from text captions. It is very similar to GPT-3 and Image GPT and produces amazing results.
DALL-E is a new neural network developed by OpenAI based on GPT-3.
In fact, it’s a smaller version of GPT-3 using 12-billion parameters instead of 175 billion. But it has been specifically trained to generate images from text descriptions, using a dataset of text-image pairs instead of a very broad dataset like GPT-3. It can create images from text captions using natural language, just like GPT-3 creates websites and stories.
It’s a continuation of Image GPT and GPT-3 that I both covered in previous videos if you haven’t watched them yet.
DALL-E is very similar to GPT-3 in the way that it is also a transformer language model receiving text and images as inputs to output a final transformed image in many forms. It can edit attributes of specific objects in images as you can see here. Or even control multiple objects and their attributes at the same time. This is a very complicated task since the network has to understand the relation between the objects and create an image based on its understanding. Just take this example, feeding to the network “an emoji of a baby penguin wearing a blue hat, red gloves, green shirt, and yellow pants”. All these components need to be understood, the objects, the colors, and even the location of the objects. Meaning that the gloves need to be both red and on the hands of the penguin, the same thing for the rest. And the results are very impressive considering the complexity of the task.
We can just see another, more simple example where we just fed “a small red block sitting on a large green block” to the network. Now it just needs to know that there are two blocks, their colors, and one being smaller and the other bigger. This seems very simple to us, but it needs a really high level of understanding to be able to achieve this. It is still not perfect as you can see, but we are getting pretty close!
DALL-E is also able to change the viewpoint of a scene. For example, here we sent “an extreme close-up view of an eagle on a mountain” and these are the results.
Here, we just changed the eagle for a fox and this is what is generated.
Of course, a simple caption can produce an infinitude of plausible images, nobody knows what you have in mind if you think of a “painting of a fox sitting in a field at sunrise”. There are many variables such as the fox itself, its colors, where is it looking at, what is its position, and we are not even talking about the background and the style of the painting. Fortunately, since it is very similar to GPT-3, we can add details to the input text and generate something much closer to what we expected, just as you can see here with different styles of paintings.
It can also generate images using objects that are not related to each other, like creating a realistic avocado chair or generate original and unseen illustrations like a new emoji.