PixelSNAIL is an autoregressive generative model:

In this case, \((x_1, ..., x_n)\) are the pixels of an image.

Advantages of using an autoregressive generative model:

  • Tractable likelihood and easy training (as opposed to GANs)
  • Outperforms latent variable models

Possible conditional models, and why they don’t work:

  • Traditional RNNs suffer from really long-range dependencies
  • Causal convolutions (see PixelCNN) have a finite size receptive field
  • Self-attention (Attention Is All You Need/Transformer) requires keeping access to all previously generated elements

Choosing an ordering for the pixels is an arbitrary choice. Usually, a raster scan is chosen :

For example, causal convolutions (PixelCNN) are designed using a raster scan ordering :

The idea of PixelSNAIL is to combine a residual block and a self-attention block.

Receptive field for a randomly initialized model (Derivative of the predicted yellow pixel w.r.t the input):


They compare results with other tractable likelihood methods on CIFAR-10, ImageNet 32x32 and ImageNet 64x64.