

On January 5, 2021, OpenAI released DALL-E. A powerful neural network named after the artist Salvador Dalí and Pixar’s WALL-E.
OpenAI just crushed the previous generative AI performances yet with another AI. If you have not been following what OpenAI is doing, we will unravel one of their most advanced AIs in this article. You will also discover how this kind of AI may come into your everyday life in the near future.
If you are not familiar with the latest advancements in AI, many companies have been trying to achieve what DALL-E is capable of for years. But with many of them, the technique that OpenAI used to create DALL-E was not even invented yet. A technique called ‘The Transformer’ had to be invented for DALL-E to become a reality. And if you are not comfortable with artificial intelligence terminology, do not worry, you do not have to understand everything in full detail just yet, as you keep reading this article, things will get more clear and I will also post many more tutorials and projects on machine learning and artificial intelligence that will make things understandable one article at a time.
DALL-E is a state of the art neural network created by OpenAI. It is released on January 5, 2021 along with another neural network called “CLIP”. If you have never interacted with an advanced AI, OpenAI enables you to experiment with DALL-E on their website. You do not need to know how to write machine learning code just yet, you can interact with DALL-E with a simple user interface. If you are just starting out or on the verge of starting out with machine learning, you can see what cutting edge AI is capable of and interact with it yourself.
With your input, DALL-E can generate realistic images from unrealistic scenarios. This ability comes from the latest technique in Natural Language Processing or NLP for short. This technique is called “The Transformer”. The transformer model first came out in 2017, and was introduced in the famous paper called ‘Attention is all you need’. This paper took the world of NLP by storm. It introduced a new neural network architecture based solely on an attention mechanism called “the transformer”, which replaces the use of previous techniques and layer types in NLP such as RNNs and LSTMs. If you have been keeping up with the trends in AI for some time, you may remember that RNNs and LSTMs were dominating the NLP world.
DALL-E is not the first time a transformer model performed exceptionally well. You may have heard of the famous GPT-2, GPT-3 and Image GPT. If you haven’t, what you need to know right now is that GPT-2 was able to peform virtually any natural language task, and GPT-3 was basically a supercharged version of this. It could basically generate super realistic texts. It was so realistic that OpenAI did not release the full access to the public. They also did not release the full neural network architecture like they usually did previously. What they did instead was to allow individuals and companies to interact with the network, as well as providing a private beta API that allows some companies to integrate GPT-3 into their products.
Currently, the GPT-3 API that OpenAI provides is used millions of times per day. GPT-3 essentially completes a given text, by matching the text pattern. What this means is that if you ask it to summarize a text, it will return you a simplified version of that text, if you ask for it to write code, it will return you a code that does what you ask for. And one of the best parts is that you can interact with the API using natural language, meaning that if you want it to simplify a legal text, you can literally give it the text and ask it to summarize it in plain English.
Even though DALL-E uses a very similar architecture to GPT-3, it is a much smaller version of it. DALL-E, when given text, generates realistic images and this is thanks to its training on text-image pairs. However, contrary to popular approach, DALL-E has never seen an image. Well, not directly anyways. The default approach in machine learning is to feed image pixel values to the neural network, after a normalization step. However, with DALL-E, the images are not directly fed into the network in their pixel values, but they were converted into “tokens” and then fed into the network. What this means is that the network never sees the actual images, but rather sees some representation of those images. Tokenization itself is the default approach for working with text data. In this case, the pairs of images and text are both tokenized and then fed into the neural network.
DALL-E can do one thing really well. Generating images from text input. It can create images from scratch, as well as completing a given partial image. Although the primary thing DALL-E can do is to generate images from text, this ability can be configured in different ways to achieve different objectives, which we will explore later on in this article. DALL-E is not perfect, but it is way more advanced than its predecessors.
There are many attributes and properties DALL-E can understand from a given text. Using these attributes and properties, DALL-E generates realistic images. Some of the images it generates need improvement, but some were straight out impossible with the previous text to image neural networks.
DALL-E not only has the ability to generate 2D images with varying complexity but the images it generates also reflect the complexity of a 3D scene. For example, if you ask for a cat gazing at its own reflection, or a close up view of an animal, or if you want the image of the same animal in day time and night time, it is able to generate all of that. DALL-E to a certain degree, has time perception, geometric understanding of shapes, textures, and as a result, renders accurate reflections and shadows of objects, as well as rendering multiple objects with different properties and much more.
PyTorch. OpenAI used to utilize both TensorFlow and PyTorch until they standardized their choice of deep learning library for research with PyTorch, as they noted in their blog post on January 30, 2020. Since then, all the neural networks released by OpenAI are created with PyTorch, including the famous GPT2, GPT3, ImageGPT, and now DALL-E.
Short answer is yes, you can. You can head over to OpenAI’s website and start experimenting with DALL-E. Note that you will not be able to enter text prompts from stratch, but you will be able to configure parts of the pre-selected text and see the results for yourself. You can find many example sentences and the corresponding images generated by DALL-E. To start experimenting with DALL-E, you can simply google for it, or for your convenience, you can click on the link down below:
No, not by a humans anyways. Remember when I said OpenAI released two neural networks simultaneously? This is where the second network steps in. When you enter a text prompt, DALL-E generates 512 images, but those images are ranked by the second neural network named “CLIP” and the top 32 images are displayed for you. CLIP essentially connects images and natural language text, and ranks the images based on how realistic the generated images are for the given description. CLIP actually learned how to do this from the data on the internet.
By default, DALL-E generates 2D images. However, we can also give it a starting partial image to complete, in accordance with the text prompt. For example, we can ask for a photograph of a 3D object, and give it a starting partial object to complete. As we rotate the starting image, DALL-E will complete the rest of the image to make sure it looks realistic. As a result, with the right methods, we can procedurally generate 3D objects, when we have a small starting point. Therefore, no, DALL-E cannot directly generate 3D objects, but because it captures the complexity of a 3D scene, we can extract sequence of 2D images from different angles to turn into a 3D object with post-processing. You can see an example of this on their website with the name “a photograph of a bust of homer”.
DALL-E can be used for generating 2D and 3D game characters and scenes. For example, you can describe an arbitrary game character, and once you find one you like, you can ask for that same character from different angles or with holding different tools or wearing different clothes. DALL-E can also be used for the creation of 2D cartoons and comics, which can be created with both existing and newly created characters and scenes. The main character activities can be created by DALL-E and frame transitions can be filled with animation software. Other things you can do with DALL-E can include on demand stock images for content creators as well as on demand home design. This can especially be useful for real estate, such as for planning home staging before open houses, and for better client-product match. On one hand, you can have clients creating their dream homes, on the other hand, you can have existing homes that can be matched for similarity using different metrics.
DALL-E can also be combined with other neural networks to perform more complex tasks such as creating new designs from your speech, if combined with a speech to text AI. DALL-E can also be combined with an AI that predicts 3D shapes from 2D images to generate 3D objects or scenes for product design. Because DALL-E is a generative AI, as we try to input completely different text inputs, we can discover new horizons on what more it can do.
This was all the basics you needed to know. But I highly recommend you to go to OpenAI’s website and interact with DALL-E. The results you get will definitely surprise you in varying degrees as this is currently the most advanced AI generating images from text input.
I hope you got some value out of this article. If you have any questions, leave them as a comment and I will get back to you as fast as I can. If your question requires a more extensive explanation, I can explain that in another article.