Introduction

In this paper, they show that when using batch normalization and weight decay one can use an exponentially increasing learning rate and still have good results. They also present a mathematical proof that the exponential learning rate is equivalent to BN + SGD + StandardRateTuning + WeightDecay + Momentum.

  • It is usual when using BN to lower the learning rate when the validation loss is plateau-ing.

They use scale invariance to manage the computations on the proofs, namely:

Warm-up

First, they show that Fixed LR + Fixed WD can be translated to an equivalent Exponential LR. Consider the following notation

Then the theorem reads:

Main Theorem


An example of this theorem is:

Correcting Momentum

A definition of Step Decay is needed to understand the useful exponential learning rate schedule

Testing the mathematical equivalence they get:

More Experiments

The TEXP learning rate contains two parts when entering a new phase (i.e. training period where the learning rate changes and then stays the same for a while):

  • An instant learning decay \(\frac{\eta_I}{\eta_{I-1}}\).
  • An adjustment of the growth factor \(\alpha_{I-1}^* \to \alpha_I^*\).

Conclusion

  • This shows that Batch Normalization allows very exotic learning rate schedules, and verifies these effects in experiments.

  • The learning rate increases exponentially in almost every iteration during training

  • The exponential increase derives from the use of weight decay, but the precise expression involves momentum as well

So the authors conclude that: the efficacy of this rule may be hard to explain with canonical frameworks in optimization.