Open AI CLIP: learning visual concepts from natural language supervision

A transformed-based neural network that uses Contrastive Language–Image Pre-training to classify images

DALL-E seems to have gotten most of the attention this week, but I think CLIP may end up being even more consequential. We’ve been experimenting with it this week and the results seem almost too good to be true; it was even able to classify species of mushrooms in photos from my camera roll fairly well.

By Brad Dwyer on Facebook

A few days ago OpenAI released 2 impressive models CLIP and DALL-E. While DALL-E is able to generate text from images, CLIP classifies a very wide range of images by turning image classification into a text similarity problem. The issue with current image classification networks is that they are trained on a fixed number of categories, CLIP doesn’t work this way, it learns directly from the raw text about images, and thus it isn’t limited by labels and supervision. This is quite impressive, CLIP can classify images with state of the art accuracy without any dataset-specific training.

The main selling point for CLIP

OpenAI is trying to move away from conventional supervised learning methods. For instance, ImageNet (the largest images dataset) is only able to classify images that belong to the classes that it was trained on. It doesn’t make sense to keep adding a new class to the dataset and re-train the network long-term.

The ImageNet dataset, one of the largest efforts in this space, required over 25,000 workers to annotate 14 million images for 22,000 object categories. In contrast, CLIP learns from text–image pairs that are already publicly available on the internet. Reducing the need for expensive large labeled datasets has been extensively studied by prior work.

Source: OpenAI

Just imagine how much it costs to employ 25,000 workers!

The main selling point for CLIP is zero-shot image classification, this means that you can take a piece of text and an image and you can send these through the network and get a prediction of how likely they are to be similar.

This means that you can do classification without doing any prior training on your data set for your custom use case and this is really impressive because before this was the way that pretty much all classification networks were built as you would have a custom data set which would represent the sort of things you want to classify and then you would have images that match up with those and you have to send those through a training procedure and ultimately get your network out at the end while clip lets you circumvent.

Quick review: Contrastive learning

Contrastive learning is an approach to formulate the task of finding similar and dissimilar things for an ML model. Using this approach, one can train a machine learning model to classify between similar and dissimilar images.

Source: AnalyticsVidyha

To understand the power of this model, you have to understand what contrastive learning is. Contrastive learning has seen a boom of interest in self-supervised learning techniques especially in computer vision with papers like Simclr and Moco.

Photo by Max Baskakov on Unsplash

You can think of Contrastive learning as a matching problem. If you were to match the picture of a cat to another similar one, you can do it easily. First, recognize the first cat, then find an image of another cat. So, you can contrast between similar and dissimilar things.

How does it do that?

I think that one of the main reasons why this model outperforms other state-of-the-art models is that it uses a mixture of NLP and Computer vision techniques.

Contrastive pre-training

Pre-training methods are becoming more and more popular over the last few years and have revolutionized NLP.

The model starts off with contrastive pre-training where image text pairs are matched with the similarity from a batch of images. This is done using an image encoder and a text encoder. Contrastive pre-training attempts to learn noise invariant sequence representations which encourage consistency between the learned representations and the original sequence.

They got the inspiration from VirTex which is a pretraining approach using semantically dense captions to learn visual representations. This approach has been shown to surpass other supervised approaches such as classic high-end ImageNet networks.

2. Zero-shot prediction (as explained above)

This is pretty cool, if you want to try it for yourself, I recommend checking out this awesome blog post:

CLIP is awesome and revolutionary, but…

Every great model has its limitations. Although CLIP outperforms state-of-the-art models, it does have a few downsides.

The first one is that it doesn’t perform too well on systematic tasks such as counting the number of objects in images
Week generalization ability on images not covered in its pre-training dataset.
Sensitive to wording and phrasing

Final thoughts and takeaway

The purpose of this article is not to overhype CLIP as this is usually done with a lot of the brand new ML models. However, it’s always great to see innovation and new ideas. I hope you got the sense that the developers of CLIP were trying to move away from traditional ML techniques to more novel ones. I think that the first move to a more novel approach is always the more difficult one and I am sure we will be looking at better approaches built on CLIP in the future. If you want to find out more about the technical details of CLIP, I suggest having a look at their paper here