Highlights

  • New self-supervised learning framework, called DINO, that synergizes especially well with vision transformers (ViT);
  • In-depth comparison of emerging properties ViT pretrained with DINO, compared to convolutional networks (convnets) and other ViT trained in a supervised fashion. The most interesting emerging properties are:
    • Self-attention modules of the last block explicitly contain scene layout and object boundaries;
    • Features perform well at top-1 accuracy on ImageNet when used as input to a basic nearest neighbor classifier.

Introduction

The authors theorize that one of the main reasons for the success of transformers in NLP is the use of self-supervised training, and that the muted success of ViT comes from supervision in pretraining. According to them:

image level supervision often reduces the rich visual information contained in an image to a single concept selected from a predefined set of a few thousand categories of objects.

While not proposing much brand new ideas, the authors perform a thorough investigative work into the engineering required to make self-supervised ViT work in computer vision, listing the required tricks to avoid collapse during training.

Methods

The new contribution is a self-supervised training method called DINO, because it can interpreted as knowledge distillation with no labels. Fig. 2 summarizes the main idea behind DINO.

Main engineering components to ensure good training:

  • Multi-crop training: The self-supervised training is formulated as learning to predict the same representation for different crops of the image. The crops are categorized as global views (crop covers > 50% of the original image), or local views (crop covers < 50% of the original image). The teacher network receives only global views, while the student network receives all the crops. The idea behind this is to encourage “local-to-global” correspondence in extracted features.
  • Momentum encoder: The teacher network is updated with an exponential moving average (ema) of student parameters. In contrast to other teacher-student approaches, the teacher here consistently out-performs the student. The authors theorize that the teacher becomes a sort of model ensemble, similar to Polyak-Ruppert averaging.
  • Centering and sharpening of teacher outputs: To avoid collapse towards learning trivial features, the authors propose to use two operations with opposite effects on the teacher’s outputs, that together make the network learn adequately:
    • Centering: Basically adding a bias term to the output s.t. \(g_{\theta}(x) \leftarrow g_{\theta}(x) + c\), where the centering term \(c\) is computed on batch statistics and updated using an ema. It prevents one dimension from dominating a softmax, but encourages collapse to a uniform distribution.
    • Sharpening: Obtained by using a low value for the temperature \(\tau\) in the softmax: \(\frac{exp(g_{\theta}(x) / \tau)}{\sum^K_{k=1} exp(g_{\theta}(x) / \tau)}\). It has the opposite effect of centering, i.e. it “sharpens” the softmax, at the risk of letting one dimension dominate.

Data

The authors mainly used DINO to pretrain ViT using ImageNet, but in some experiments’ setup other (unsupervised) datasets were used.

Results

The full paper (and even more so with the supplementary materials) provides a lot of results on different downstream tasks. They also provide a comprehensive ablation study regarding DINO’s components, as well as an analysis of how and why training fails when the framework lacks some essential tricks.

The figures/tables below were selected because they detail results on common image tasks/datasets.

References