# Highlights

The paper behind the pytorch OneCycleLR learning rate scheduler

# Introduction

The goal here is to adjust the learning rate $$\epsilon$$ of a simple stochastic gradient descent:

They start with the AdaSecant method by Gulcehre et al. which builds an adaptive learning rate method based on the finite difference approximation of the Hessian matrix (denominator):

Then, by combining Eq.(4) and (7), they get the following updating rule:

The learning rate is then updated with the following moving average:

Doing so with a fix learning rate at the beginning of the optimization process gives the blue curve in the following figure:

However, in their case, they use 2 learning rates : $$LR_{min},LR_{MAX}$$ which they linearly interpolate during a cycle. During the first half of the cycle, the LR goes up from $$LR_{MIN}$$ to $$LR_{MAX}$$ and during the second half the LR goes down.

Doing so leads to the orange curve in the previous plot which corresponds to a larger learning rate than the blue curve and thus a faster convergence.

# Results

Quite amaizing indeed! PC means : piecewise-constant training regime (LR is fixed and then reduced by a factor of X when the validation error plateaus)