Contribution

The authors introduce an extension called ‘Squeeze-and-Excitation’ (SE) block which should enable a network “to perform feature recalibration through which it can […] selectively emphasise […] and suppress” features.

They show how such SE-blocks improve performance on several datasets for several architectures while maintaining a reasonable network complexity (in terms of number of parameters as well as computational load).

Proposed mechanism

The basic idea is to enforce the network to regard non-linear interdependencies between spatial features in different channels without any supervised intervention. This is achieved by reducing the output features of a transform block of the original network by a global statistic (e.g. global average pooling) and predicting a scalar weight per channel from such a vector of channel-wise (scalar) statistics.

SE-block

\(\mathbf{F}_{tr}\): transform of the original network, e.g. convolutional block
\(z_c = \mathbf{F}_{sq}(\mathbf{u}_c) = \frac{1}{H \times W} \sum_{i=1}^H \sum_{j=1}^W \, u_c(i, j)\): squeeze operation
\(\mathbf{s} = \mathbf{F}_{ex}(\mathbf{z}, \mathbf{W}) = \sigma(g(\mathbf{z}, \mathbf{W})) = \sigma(\mathbf{W}_2\delta(\mathbf{W}_1\mathbf{z}))\): excitation operation (\(\delta\): ReLU)
\(\tilde{\mathbf{x}}_c = \mathbf{F}_{scale}(\mathbf{u}_c, s_c) = s_c \cdot \mathbf{u}_c\): recalibration operation (i.e. rescaling)

Examples for extension of existing architectures

Experiments

Extension of existing architectures

SE blocks led to improvements for all investigated base-networks on the ImageNet 2012 dataset.
The computational overhead is small.
The improvement is the same for different network depths.

Different data sets

Similar improvements were also shown for other datasets
- Scene Classification: Places365
- Object Detection: COCO

Analysis of reweighting step

Caption: Colored curves represent the average activations for different classes (computed over 50 samples for each class) plotted over channel index.

In ‘early’ layers, the activations of the excitation step (i.e. rescaling weights) are the same among different classes.
In ‘later’ layers (e, f), the activations are saturated.
The reweighting among channels seems to be most significant in ‘intermediate’ layers.