Budget-Aware Regularization (BAR) allows to simultaneously train and prune a neural network architecture, while respecting a neuron budget. The method is targeted at “turning off” irrelevant feature maps in CNNs. To find which feature maps are relevant, “Dropout Sparsity Learning” is used. To respect the budget, a novel barrier function is introduced.

Contributions at a glance:

• Objective function that optimizes/constrains the total number of neurons
• Novel barrier optimization method
• Mixed-Connectivity Block that leverages atypical connectivity

# Dropout Sparsity Learning

We learn which feature maps to keep by using a special kind of dropout. The first difference with traditional dropout is that each feature map is associated with a different Bernoulli parametrization (e.g. probability of being active). Let’s call this “feature-wise dropout”.

$$\mathbf{z}$$ is a vector of “dropout variables”. The goal is to learn the parametrization of each dropout variable, with a preference for parameterizations that turn off the feature map. However, ordinary feature-wise dropout, because of its discrete nature, does not allow to learn the parameters by backpropagation. We will thus use a continuous relaxation of Bernoulli, named the Hard Concrete distribution. Using the Reparametrization Trick, we can draw samples from the distribution like so:

\begin{aligned} z &= g(\Phi_{l-1},\epsilon) \\ z &\in [0,1] \\ \epsilon &\sim \mathcal{U}(0,1) \end{aligned}

We rewrite Eq. (1) like this:

The following is a plot of the function $$g(\Phi,\epsilon)$$ associated with the Hard Concrete. For our purposes, $$\Phi := \alpha$$.

Here is an intuition about this plot. Take the blue curve; it is associated to a irrelevant feature map. Why? Imagine you draw samples $$\epsilon \sim \mathcal{U}(0,1)$$ and you give them to this function. It will mostly output zero; in fact $$P(z=0) \approx 0.6$$.

The interpretation of $$\alpha$$ is that it controls $$P(z=0)$$, $$P(z=1)$$, and the probability density in between (for $$z \in~]0,1[~$$). The weird thing with the Hard Concrete, is that while there is virtually no chance of picking a specific value, for example $$z=0.5$$ (why 0.5 when you could have 0.50001 ?), there is a substantial probability of picking exactly $$z=0$$ (or $$z=1$$). Why should we give a $$\int u \subset k$$, you ask? Well, we’d like $$z$$ to be zero as much as possible, since this corresponds to a pruned feature map. But if $$z$$ is close to zero but non-zero, the signal can still go through; the next layer could scale it up to a useful range. Thus, picking exactly zero is important1.

To sparsify the network, we will minimize $$P(z>0)$$ for all $$z$$, for which we have a closed-form expression $$L_S(\Phi)$$.

# Budget-Aware Regularization

To promote the removal of neurons, a sparsity term is used in the loss. To enforce the budget, a novel barrier function $$f(V,a,b)$$ is used:

where $$V$$ is the volume of the activation maps2 and $$b$$ is the budget. For this figure, $$a=1$$; this hyperparameter corresponds to the value of $$V$$ where the budget is comfortably respected. The barrier function is multiplied with the sparsity loss:

During training, $$b$$ is transitioned from a value larger than the initial $$V$$, to $$B$$, the “final budget”.

The complete loss function is the sum of a data term and the sparsity loss term (above). The data term can be a cross-entropy, but a Knowledge Distillation objective is suggested for easier pruning.

# Mixed-Connectivity Block

By pruning feature maps in residual blocks, atypical connectivity emerges:

These connectivity patterns generally do not provide computational savings, unless they are leveraged by a special Mixed-Connectivity block implementation (given in the paper).

# Results

At the highest pruning factors, this method outperforms the state of the art. It also requires less hyperparameter tuning.

# Notes

1. Picking exactly one is nice also (it would mean that a feature map is REALLY important). But it’s not as crucial as $$P(z=0) \gg 0$$.

2. $$V = \sum_l \sum_i \mathbf{1}(\mathbb{E}[z_{l,i} \vert \Phi] > 0) \times A_l$$ where $$l$$ is the layer, $$i$$ is the feature map and $$A_l$$ is the area of the feature maps.