It used to be super challenging for a computer to automatically generate images based on some text descriptions. Some examples are
- A copybara sitting in a field at sunrise.
- An armchair in the shape of an avocado.
- The exact same teapot on the top with ‘gpt’ written on it on the bottom.
Last week, Open AI shared a deep neural network, DALL*E, which takes those text descriptions as an input and automatically generates images that match those text descriptions. In this post, we will take a look at the model and some of the awesome images the model generates.
DALL*E is able to understand a wide variety of contexts, such as attributes of the objects, compositions of different objects, as well as those requiring geographical knowledges and visual reasoning.
DALL*E model is able to understand the attributes of an object, such as color, shape and texture. For example, if we give the model the following text input,
a pentagonal green clock, the model generates the following figure. The model understands what a clock is as well as size and color of the clock.
The following figure is generated for the text
a cube made of porcupine
Composition of different objects
DALL*E is able to generate images consisting more than 1 object. The model is even able to understand the spatial relation between the objects. The image generated for text
A small red block sitting on top of a large green block is shown below.
DALL*E has incorporated geographical knowledges into itself. It is able to understand terms that have specific real life meanings, for instance, Chinese food and San Francisco’s golden gate bridge. Generated images are shown below.
What makes the model even cooler is that it is able to make reasoning when generating the image. One example is
The exact same teapot on the top with gpt written on it on the bottom. The model needs to understand that it needs to generate an image with two teapots. The two teapots should be the same and arranged such that one is on the top and another is on the bottom (not left/right). The bottom teapot needs to have gpt on it. And the following image is generated.
DALL*E model is a decoder-only transformer. The input to the model are 1280 tokens, 256 for the text and 1024 for the image. (The image generated is of size 1024. If you look closely at the golden gate bridge, it could be a bit blurry.) The error function is the likelihood of generating those tokens, one after another. The blog doesn’t share many details of the model. I will have a follow up when Open AI publishes a paper for the model.