Highlights

CLIP (Contrastive Language-Image Pre-Training) is a SOTA image/text joint representation learning method trained from scratch on a dataset of 400 million (image, text) pairs downloaded from the internet. CLIP is SOTA on several datasets, in particular when used as a zero-shot learning method. Surprisingly, without having seen a single ImageNet image during training, it outperforms on ImageNet a fully-trained ResNet-50!

Contrastive training

The main idea of the method is fairly easy to understand. As shown in figure 1, the system has a text and an image encoder which are both fed with a batch of N elements : N images for the image encoder and N text descriptions for the text encoder. For each batch, the i-th image fits the i-th description. Following the algorithm of figure 3, the image and text feature vectors of each data are projected to a joint latent space and scaled with a pairwise cosine similarity function. This results into an \(N\times N\) set of (image,text) embedding pairs. Of these pairs, \(N\) are positive pairs and \(N\times N-N\) are negative pairings. The resulting cross entropy loss is a positive-vs-negative contrastive loss.

Test time

At test time, one image is fed to the image encoder while N text sentences are fed to the text encoder. Notice that all sentences is a text prompt such as “A photo of a” which includes the class name (plane, car, dog, etc). The authors advocate that engineering the right prompt is a key to success. For example, in order to disambiguate the word boxer (dog breed vs the sport) the prompt could be “photo of a boxer, an animal” in the case of an animal dataset. At the end, the image is classified according to the sentence whose pairwise cosine similarity is the largest.

Results

The paper contains numerous results obtained with different image encoders (ResNet and Vision Transformer) and a Transformer network as text encoder. From these results, I kept the following three:

In Table 1, we get the accuracy of CLIP on 3 datasets compared to Visual N-Grams, a well-known zero-short learning method. Notice the 76,2% accuracy on ImageNet, on par with ResNet trained on that dataset!
Figure 5 shows that zero-shot CLIP is very competitive on 27 datasets. Here it is compared with a linear classifier fitted on ResNet-50 features.
Figure 11, CLIP feature space is better than that of the best ImageNet pretrained model.