
Learning Transferable Visual Models From Natural Language Supervision
https://cdn.openai.com/papers/Learning_Transferable_Visual_Models_From_Natural_Language_Supervision.pdf
OpenAI propose CLIP, which uses image and text encoding networks to perform tasks such as image classification in a GPT-3-like zero-shot fashion. Since CLIP uses natural language, it can respond more flexibly to various tasks than models pre-trained with ImageNet. For example, it can assign categories to illustrations. In addition to image classification, it can also perform tasks such as action detection, OCR, and object detail classification. However, it is not good at tasks such as counting objects or expressing distances (GPT-3 had the same problem of not being able to understand physical relationships such as cheese melting).