## Summary

A “guiding block” is added inside a CNN (at the “smallest” encoding layer) that uses text hints to modify feature maps in order to refine the CNN’s prediction.

### Guiding block

In order to modify activation map $$A \in \mathbb{R}^{H \times W \times C}$$, an RNN reads a sentence (parsed into a word embedding), and the last RNN state is fed to a dense layer that predicts:

• Channel re-weighting vector $$\gamma^{(s)} \in \mathbb{R}^C$$ and bias $$\gamma^{(b)} \in \mathbb{R}^C$$
• Spatial re-weighting vectors $$\alpha \in \mathbb{R}^H$$ and $$\beta \in \mathbb{R}^W$$

Thus, a single element of the modified feature map is: $$A^{\prime}_{h,w,c} = (1 + \alpha_h + \beta_w + \gamma^{(s)}_c) A_{h,w,c} + \gamma^{(b)}_c$$

This way, the number of parameters is much lower than a fully connected transformation. For example, for $$32 \times 32 \times 1024$$ activation maps, instead of $$\approx$$ 1 million parameters $$( H \times W \times C )$$, only 1088 $$( H + W + C )$$ are needed.

### Hint generation algorithm

An automatic hint generation algorithm was developed to generate hints that are useful for the model.

Using the ground truth and the predicted segmentation, a “query” is created based on a few criteria such as missing semantic classes, noise in the prediction or wrongly predicted pixels that should be replaced.

To do so, the image is divided in a coarse grid, and each cell is evaluated for missing or mistaken classes. Then, a class is selected from the possible choices and a query is generated based on the position in the grid.

## Experiments

Various experiments were done on PascalVOC (2012) and MSCOCO-Stuff (2014), that evaluate where the guiding block should be placed, how many hints are needed, how complex should the hints be, etc.