Semantic Image Synthesis with Spatially-Adaptive Normalization

Introduction

Conventional convolution network architectures, using normalization layers, tend to “wash away” [sic] information in input semantic masks. Therefore, the authors mention that they are sub-optimal at best for semantic image synthesis.

The authors propose a new normalization layer called SPatially-Adaptative DEnormalization (SPADE) to overcome these limitations.

SPADE

Whereas batch norm normalizes the output of an activation according to a unit Gaussian distribution, SPADE then denormalizes the output with learned scale \(\gamma\) and bias \(\beta\) in the form of a two-layer CNN.

The activation value after a SPADE layer at site \(i\) is given by

where \(i\) is the i-th pixel of the output; \(n, c, y, x\) are the batch element, channel, y and x coordinates, \(m\) the input segmentation mask, \(h\) the output of the activation layer and \(\mu\) and \(\sigma\) are given by:

Architecturally-speaking, it looks something like:

The authors argue that it is a generalization of Conditional Batch Norm ¹as SPADE would be the same if you replaced the segmentation mask with the image class label and made the modulation parameters spatially invariant (\(\gamma^i_{c,y_1,x_1} \equiv \gamma^i_{c,y_2,x_2}\equiv ...\))

Architecture

The architecture is quite simple

The full structure of the networks is available in the paper.

Results

They’re pretty wild

A bunch of other metrics (such as qualitative evaluation results), as well as ablation and augmentation studies, are available in the paper.

References

1: Conditional Batch Norm: https://arxiv.org/pdf/1707.03017.pdf

Page (look at the video !)

Code