LCNN is a novel layer that can replace any convolutional layer. The main thing is that a filter is “decomposed” as a linear combination of parts from a dictionary. The ability to tune the dictionary size and the number of elements in the linear combination allows an efficiency/accuracy trade-off (see Figure 3). In addition, the authors argue that LCNN has an advantage in few-shot learning and few-iteration learning settings.

Filter Construction

Let’s first define the 4 tensors that we’ll need:

\(\bold{W} \in \mathbb{R}^{m \times k_w \times k_h}\) is a filter, where \((k_w, k_h)\) are its spatial dimensions and \(m\) is the number of channels in the input image.
\(\bold{D} \in \mathbb{R}^{k \times m}\) is the dictionary. It contains \(k\) vectors of size \(m\). Those vectors will be combined to construct the filter (See Figure 1).
\(\bold{I} \in \mathbb{N}^{s \times k_w \times k_h}\) is the lookup indices tensor, where \(s\) is the number of elements in the linear combination.
\(\bold{C} \in \mathbb{R}^{s \times k_w \times k_h}\) is the coefficient tensor.

Then, we can formulate the construction as follows:

\[\bold{W}_{[:,r,c]} = \sum_{t=1}^s \bold{C}_{[t,r,c]} \cdot \bold{D}_{[\bold{I}_{[t,r,c]},:]}\]

Fast Convolution using a Shared Dictionary

Now, let’s imagine that our dictionary is in fact the weights of a 1x1 convolution layer: we thus have \(k\) filters of dimensions \((m, 1, 1)\). By passing our input image \(\bold{X} \in \mathbb{R}^{m \times w \times h}\) through this 1x1 convolutional layer, we get \(\bold{S} \in \mathbb{R}^{k \times w \times h}\). With \(\bold{I}\) and \(\bold{C}\) in hand, we can compute the result of the convolution \(\bold{X} * \bold{W}\) for each filter, without really constructing any \(\bold{W}\). (See section 3.1.1 of paper.) By reducing \(k\) and \(s\), we can reduce the number of lookups and floating point operations.

One problem arises from using indices to select values in tensors: it makes the network non-differentiable. During training, a trick is used to be able to backpropagate. For each filter \(\bold{W}\), a sparse tensor \(\bold{P} \in \mathbb{R}^{k \times k_w \times k_h}\) is constructed. In this tensor, elements at indices \(\bold{I}\) have values \(\bold{C}\). Then, we have that \(\bold{X} * \bold{W} = \bold{S} * \bold{P}\) for each \(\bold{W}, \bold{P}\) pair.

During training, we learn \(\bold{D}\) and \(\bold{P}\). To obtain a tensor \(\bold{P}\) that is \(s\)-sparse, we impose a \(\ell_1\) norm and keep only \(s\) non-zero elements. From \(\bold{P}\), we can easily obtain \(\bold{I}\) and \(\bold{C}\).

Few-shot learning

The authors argue that this layer architecture has a performance advantage in few-shot learning settings, as we can see in the following plot (extracted from Figure 4).

Few-iteration learning

The authors also argue that LCNN can learn more in the first iterations of training than an ordinary CNN. In one experiment, they transferred the dictionary learned with a shallow network to a deeper network and only trained \(\bold{I}\) and \(\bold{C}\). See Figure 5 for the learning curves.