This paper is in review for ICLR 2021: https://openreview.net/forum?id=YicbFdNTTy

Highlights

The transformer architecture, known for its performance on NLP tasks, is applied to image classification
The proposed architecture, “ViT” (Vision Transformer) is shown to perform as well or better than CNNs for image classification on large scale datasets
The usefulness/superiority of the proposed transformer over CNNs only appears when the number of images in the dataset reaches about 100 million
The proposed architecture brings a reduction of FLOPS of a factor of 2 to train, compared to CNNs, for a given performance
The authors have used more than 25,000 TPUv3 * days over their experiments

Methods

Architecture

Images are split into patches (16x16 yields the best results)
The patches are flattened, and become the tokens (or “words”)
The flattened patches are projected using a MLP
Each flattened patch is concatenated with a vector which represents the position of the patch. A positional embedding is learned, which maps a one-dimensional patch index to a vector representation.
Note that this allows the very first layer to attend to any part of the image, in comparison with CNNs for which the receptive field develops over many layers.
The first token is a dummy, and is only there because the feature vector of the image will appear at that position at the last layer of the network.

Experiment design

The models are pre-trained on either ImageNet (1k classes, 1.3M images), ImageNet-21k (21k classes, 14M images) or JFT (18k classes, 303M images).
The models are then fined-tuned on one of the datasets listed in the “Benchmarking datasets” section below.
Sometimes, the model is not fine tuned, but is evaluated in a few-shot regime. This is not well described in the paper: “Few-shot accuracies are obtained by solving a regularized linear regression problem that maps the (frozen) representation of a subset of training images to \(\{−1,1\}^K\) target vectors. Though we mainly focus on fine-tuning performance, we sometimes use linear few-shot accuracies for fast on-the-fly evaluation where fine-tuning would be too costly.”
They also have tried self-supervised pre-training, in which the model predicts masked patches. This is given little importance, and the results are “only” promising, so I will not write about this further.

Benchmarking datasets

The experiments have been run on a number of datasets of image classification:

ImageNet
ImageNet ReaL (“Reassessed Labels”, from Beyer et al. 2020) Code
CIFAR10/100
Oxford Pets, Oxford Flowers
VTAB (Zhai et al., 2019b) (“VTAB evaluates low-data transfer using 1 000 examples to diverse tasks. The tasks are divided into three groups: Natural– tasks like the above, Pets, CIFAR, etc. Specialized– medical and satellite imagery, and Structured– tasks that require geometric understanding like localization.”) Blog post

Results

As seen in the table below, ViT performs slightly better than a very large ResNet, and does so using significantly less FLOPS (for the fine-tuning phase).

However, the pre-training has to involve a very large number of training samples: when this number exceeds 100 million, ViT starts to shine. Else, the ResNet performs better. See below:

ViT also compares favourably in terms of pre-training FLOPS, as seen below:

The “Hybrid” approach uses CNN feature vectors as tokens; it is not considered very important by the authors.

The figures below serve to inspect the vision transformer architecture:

Conclusions

The Vision Transformer is an architecture that can outperform CNNs given datasets in the 100M-image range. It required less FLOPS to train than the CNNs used in this paper.

Remarks

This video is a good commentary on the paper. It offers this interpretation: the transformer is a general architecture, with less priors than a CNN. It can learn the spatial structure of the problem, and a CNN cannot.