SqueezeandExcitation Networks (2017 ImageNet winner)
Code
Caffe implementation available at https://github.com/hujiefrank/SENet
Contribution
The authors introduce an extension called ‘SqueezeandExcitation’ (SE) block which should enable a network “to perform feature recalibration through which it can […] selectively emphasise […] and suppress” features.
They show how such SEblocks improve performance on several datasets for several architectures while maintaining a reasonable network complexity (in terms of number of parameters as well as computational load).
Proposed mechanism
The basic idea is to enforce the network to regard nonlinear interdependencies between spatial features in different channels without any supervised intervention. This is achieved by reducing the output features of a transform block of the original network by a global statistic (e.g. global average pooling) and predicting a scalar weight per channel from such a vector of channelwise (scalar) statistics.
SEblock
 \(\mathbf{F}_{tr}\): transform of the original network, e.g. convolutional block
 \(z_c = \mathbf{F}_{sq}(\mathbf{u}_c) = \frac{1}{H \times W} \sum_{i=1}^H \sum_{j=1}^W \, u_c(i, j)\): squeeze operation
 \(\mathbf{s} = \mathbf{F}_{ex}(\mathbf{z}, \mathbf{W}) = \sigma(g(\mathbf{z}, \mathbf{W})) = \sigma(\mathbf{W}_2\delta(\mathbf{W}_1\mathbf{z}))\): excitation operation (\(\delta\): ReLU)
 \(\tilde{\mathbf{x}}_c = \mathbf{F}_{scale}(\mathbf{u}_c, s_c) = s_c \cdot \mathbf{u}_c\): recalibration operation (i.e. rescaling)
Examples for extension of existing architectures
Experiments
Extension of existing architectures
 SE blocks led to improvements for all investigated basenetworks on the ImageNet 2012 dataset.
 The computational overhead is small.
 The improvement is the same for different network depths.
Different data sets

Similar improvements were also shown for other datasets
 Scene Classification: Places365
 Object Detection: COCO
Analysis of reweighting step
Caption: Colored curves represent the average activations for different classes (computed over 50 samples for each class) plotted over channel index.
 In ‘early’ layers, the activations of the excitation step (i.e. rescaling weights) are the same among different classes.
 In ‘later’ layers (e, f), the activations are saturated.
 The reweighting among channels seems to be most significant in ‘intermediate’ layers.