Highlights

Methods

Split image in non-overlapping patches. Typically 16x16 images. Fixed image size: 224x224.
Take each patch, and flatten and embed it to a high dimensionality vector : we get a (num_patches, vec_size) matrix
Residual Multi-Perceptron Layer (repeated multiple times)
- Cross-patch connectivity block: transpose the data matrix before passing it through a linear layer. Now, each channel is independent, and a patch can be influenced by all patches from the previous layer. The weights are the same for all channels.
- Cross-channel connectivity block: apply a linear layers to the (num_patches, vec_size).
- Non-linearities: Only one GeLU layer after the cross-channel communication.
- Normalization layers: No batch norm, layer norm or anything. Replaced by a learned affine transformation. Used before and after blocks of operations.
- Skip connections
Global average pooling
Final linear layer

They also propose a model for sequence-to-sequence modeling.

The authors reuse many aspects of the methodology of their previous paper(s) about the training of Vision Transformers.

Compare with other classification architectures (CNN, transformers) in supervised learning
- Better measure generalisation by using other ImageNet-like datasets with a “clearly defined” test set
Self-supervised learning (DINO)
Knowledge Distillation (Distill a CNN into a ResMLP)
Seq2Seq
Visualisation of cross-patch connectivity
Sparsity of the weights
Ablation studies

Surprising results for a model consisting mainly of linear layers
Benefits greatly from Knowledge Distillation
Something similar to self-attention can be done simpler than what is seen in Transformers
The use of BatchNorm, and normalization layers that rely on batch statistics in general, is questionned.

They are not alone studying this, they mention 4 concurrent works (footnote p.2)
Maybe it shows that the vast majority of examples these datasets can be solved using crude intuition. It’s only for a small number of harder examples that a more advanced architecture makes a difference. But in the small number of examples where ResMLP fails, it may not be acceptable mistakes.
Adversarial examples? Easier to fool this model?