In this awkwardly-organized paper with 17 pages of main content and 38 pages of appendix, the authors tackle many research directions related to gradient explosion/vanishing and the training of deep networks. They introduce many theoretic contributions, backed by experiments on 50-layer MLP’s of width 100.

One of their main contributions is the Gradient Scale Coefficient (GSC), a measure of the extent at which gradients explode (or vanish) in a network. They argue that other notions are insufficient at measuring - or even describing - the phenomena of gradient explosion. They also argue that gradient explosion is important, as it prevents deep networks from being successfully trained.

In figure 1A below, we see that many architectures have a linear progression of the GSC (in log-space), which denotes gradient explosion. Fig 1B and 1C are explained later.

Another contribution is a method for computing the effective depth of a network. The authors explain that the effective depth is (almost) always lesser than the compositional depth (they define the compositional depth as the number of parametrized layers) of a network. This can be caused by gradient explosion. To check their estimation of effective depth, they do the following experiment. They progressively replace the last layers of a network with a single layer, which is a Taylor expansion of the replaced layers.

In the figure below, (D) is the progression of effective depth with training epochs; (F) is the effect of the experiments with the Taylor expansion on the training error. As you can see, an architecture with a greater effective depth will suffer sooner from a compositional depth reduction. Observe the other subplots at your own risk.

Collapsing domain

The collapsing domain problem is another cause for a small effective depth. When the pre-activations that are fed into a nonlinearity are highly similar, the nonlinearity can be well-approximated by a linear function. In these cases, we call the layer pseudo-linear. For example, a tanh network that is pseudo-linear beyond compositional depth $$k$$ has approximately the representational capacity of a depth $$k + 1$$ network. The author propose two ways of measuring this problem:

• Pre-activation standard deviation
• Pre-activation sign diversity, defined as $$\frac{min(P, N)}{P+N}$$, where $$P$$ is the number of positive neurons and $$N$$ the number of negative neurons. This measure is particularly important for ReLU networks, where nonlinearities with low sign diversity can be well approximated by either the identity function or the zero function.

You have now unlocked the figure 1B and 1C! Hurray!!!