apex.optimizers

class apex.optimizers.FusedAdam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False, eps_inside_sqrt=False)[source]

Implements Adam algorithm. Currently GPU-only. Requires Apex to be installed via python setup.py install --cuda_ext --cpp_ext.

It has been proposed in Adam: A Method for Stochastic Optimization.

Parameters:
  • params (iterable) – iterable of parameters to optimize or dicts defining parameter groups.
  • lr (float, optional) – learning rate. (default: 1e-3)
  • betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square. (default: (0.9, 0.999))
  • eps (float, optional) – term added to the denominator to improve numerical stability. (default: 1e-8)
  • weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)
  • amsgrad (boolean, optional) – whether to use the AMSGrad variant of this algorithm from the paper On the Convergence of Adam and Beyond (default: False) NOT SUPPORTED in FusedAdam!
  • eps_inside_sqrt (boolean, optional) – in the ‘update parameters’ step, adds eps to the bias-corrected second moment estimate before evaluating square root instead of adding it to the square root of second moment estimate as in the original paper. (default: False)
step(closure=None, grads=None, output_params=None, scale=1.0)[source]

Performs a single optimization step.

Parameters:
  • closure (callable, optional) – A closure that reevaluates the model and returns the loss.
  • grads (list of tensors, optional) – weight gradient to use for the optimizer update. If gradients have type torch.half, parameters are expected to be in type torch.float. (default: None)
  • params (output) – A reduced precision copy of the updated weights written out in addition to the regular updated weights. Have to be of same type as gradients. (default: None)
  • scale (float, optional) – factor to divide gradient tensor values by before applying to weights. (default: 1)