# Introduction

In this paper, they show that when using batch normalization and weight decay one can use an exponentially increasing learning rate and still have good results. They also present a mathematical proof that the exponential learning rate is equivalent to BN + SGD + StandardRateTuning + WeightDecay + Momentum.

• It is usual when using BN to lower the learning rate when the validation loss is plateau-ing.

They use scale invariance to manage the computations on the proofs, namely:

# Warm-up

First, they show that Fixed LR + Fixed WD can be translated to an equivalent Exponential LR. Consider the following notation

Then the theorem reads:

# Main Theorem

An example of this theorem is:

# Correcting Momentum

A definition of Step Decay is needed to understand the useful exponential learning rate schedule

Testing the mathematical equivalence they get:

# More Experiments

The TEXP learning rate contains two parts when entering a new phase (i.e. training period where the learning rate changes and then stays the same for a while):

• An instant learning decay $$\frac{\eta_I}{\eta_{I-1}}$$.
• An adjustment of the growth factor $$\alpha_{I-1}^* \to \alpha_I^*$$.

# Conclusion

• This shows that Batch Normalization allows very exotic learning rate schedules, and verifies these effects in experiments.

• The learning rate increases exponentially in almost every iteration during training

• The exponential increase derives from the use of weight decay, but the precise expression involves momentum as well

So the authors conclude that: the efficacy of this rule may be hard to explain with canonical frameworks in optimization.