In a nutshell, this paper shows how to convert a UNet into a so-called TransUNet by using a visual transformer (ViT) network in the encoder. Details of the architecture are in Figure 1. As opposed to other methods which use a pure transformer-based encoder to convert the input image into a latent vector, the authors use a series of convolutions (much like in the original UNet) to convert the input image into a set of lower-resolution feature maps which they then encode with a ViT.


The paper starts by summarizing what a ViT is. Starting with an input data \(\mathbf{x}\in R^{H\times W\times C}\) (which could be the input image or a set of \(C\) feature maps), \(\mathbf{x}\) is sliced into \(N\) non-overlapping patches of size \(P\times P\times C\) where \(N = \frac{NW}{P^2}\). The patches are then flatten into 1D vectors and projected into a D-dimensional embedding space with a weight matrix \(E\) which leads to a set of \(N\) D-dimensional vectors:

The authors mention that by doing so, it allows the system to learn patch embedding positions (hence the use of \(E\) as weight matrix). The resulting \(z_0\) embedding vector is fed to the transformer.

As for the transformer, it consists of \(L\) (\(L=12\) in their experiments) Multihead Self-Attention (MSA) and Multi-Layer Perceptron (MLP) piped one after the other:

where LN is a layer normalization operation. The resulting latent vector (aka hidden feature in Fig.1) is then reshaped and upsampled like a usual UNet.


They tested their method on two datasets : the synapse multi-organ dataset (30 CT scan images) and ACDC. Results reveal state-of-the-art performances. All experiments were conducted on a single RTX2080 GPU.

CODE: The code is available here :