# Optimizers¶

OpenSeq2Seq supports two new optimizers: LARC and NovoGrad.

## Layer-wise Adaptive Rate Control (LARC)¶

The key idea of LARC is to adjust learning rate (LR) for each layer in such way that the magnitude of weight updates would be small compared to weights’ norm.

Neural networks (NN-s) training is based on Stochastic Gradient Descent (SGD). For example, for the “vanilla” SGD, a mini-batch of B samples $$x_i$$ is selected from the training set at each step t. Then the stocahtsic gradient $$g(t)$$ of loss function $$\nabla L(x_i, w)$$ wrt weights is computed for a mini-batch:

$g_t = \frac{1}{B} {\sum}_{i=1}^{B} \nabla L(x_i, w_t)$

and then weights w are updated based on this stochastic gradient:

$w_{t+1} = w_t - \lambda * g_t$

The standard SGD uses the same LR $$\lambda$$ for all layers. We found that the ratio of the L2-norm of weights and gradients $$\frac{| w |}{| g_t |}$$ varies significantly between weights and biases and between different layers. The ratio is high during the initial phase, and it is rapidly decreasing after few iterations. When $$\lambda$$ is large, the update $$| \lambda * g_t |$$ can become much larger than $$| w |$$, and this can cause divergence. This makes the initial phase of training highly sensitive to the weight initialization and initial LR. To stabilize training, we propose to clip the global LR $$\gamma$$ for each layer k:

$\lambda^k = \min (\gamma, \eta * \frac{| w^k |}{| g^k |} )$

where $$\eta < 1$$ is the LARC “trust” coeffcient. The coeffecient $$\eta$$ montonically increases with the batch size B.

To use LARC you should add the following lines to model configuration:

"larc_params": {
"larc_eta": 0.002,
}