# optimizers¶

## optimizers¶

Optimizer ops for use in layers and tf.learn.

optimizers.optimizers._clip_by_global_norm(t_list, clip_norm, use_norm, name=None)[source]

Clips values of multiple tensors by the ratio of the sum of their norms. Given a tuple or list of tensors t_list, and a clipping ratio clip_norm, this operation returns a list of clipped tensors list_clipped and the global norm (global_norm) of all tensors in t_list. The global norm is expected to be pre-computed and passed as use_norm. To perform the clipping, the values t_list[i] are set to:

t_list[i] * clip_norm / max(global_norm, clip_norm)
where:
global_norm = sqrt(sum([l2norm(t)**2 for t in t_list]))

If clip_norm > global_norm then the entries in t_list remain as they are, otherwise they’re all shrunk by the global ratio. Any of the entries of t_list that are of type None are ignored. This is the correct way to perform gradient clipping (for example, see [Pascanu et al., 2012](http://arxiv.org/abs/1211.5063) ([pdf](http://arxiv.org/pdf/1211.5063.pdf))). However, it is slower than clip_by_norm() because all the parameters must be ready before the clipping operation can be performed.

Parameters: t_list – A tuple or list of mixed Tensors, IndexedSlices, or None. clip_norm – A 0-D (scalar) Tensor > 0. The clipping ratio. use_norm – A 0-D (scalar) Tensor of type float (optional). The global norm to use. If not provided, global_norm() is used to compute the norm. name – A name for the operation (optional). A list of Tensors of the same type as list_t. global_norm: A 0-D (scalar) Tensor representing the global norm. list_clipped TypeError – If t_list is not a sequence.
optimizers.optimizers._clip_gradients_by_norm(grads_and_vars, clip_gradients)[source]

optimizers.optimizers.get_regularization_loss(scope=None, name='total_regularization_loss')[source]

Gets the total regularization loss.

Parameters: scope – An optional scope name for filtering the losses to return. name – The name of the returned tensor. A scalar regularization loss.
optimizers.optimizers.optimize_loss(loss, optimizer, optimizer_params, learning_rate_decay_fn, var_list=None, dtype=tf.float32, clip_gradients=None, summaries=None, larc_params=None, loss_scaling=1.0, loss_scaling_params=None, on_horovod=False, iter_size=1, skip_update_ph=None, model=None)[source]

Given loss and parameters for optimizer, returns a training op.

Parameters: loss – Scalar Tensor. optimizer – string or class of optimizer, used as trainer. string should be name of optimizer, like ‘SGD’, ‘Adam’, ‘Adagrad’. Full list in OPTIMIZER_CLS_NAMES constant. class should be sub-class of tf.Optimizer that implements compute_gradients and apply_gradients functions. optimizer_params – parameters of the optimizer. var_list – List of trainable variables. Can be used to freeze certain trainable variables by excluding them from this list. If set to None, all trainable variables will be optimized. dtype – model dtype (tf.float16, tf.float32 or “mixed”). learning_rate_decay_fn – function, takes global_step Tensors, returns Tensor. Can be used to implement any learning rate decay functions. For example: tf.train.exponential_decay. Ignored if learning_rate is not supplied. clip_gradients – float, max gradient norm to clip to. summaries – List of internal quantities to visualize on tensorboard. If not set only the loss and the learning rate will be reported. The complete list is in OPTIMIZER_SUMMARIES. larc_params – If not None, LARC re-scaling will be applied with corresponding parameters. loss_scaling – could be float or string. If float, static loss scaling is applied. If string, the corresponding automatic loss scaling algorithm is used. Must be one of ‘Backoff’ of ‘LogMax’ (case insensitive). Only used when dtype=”mixed”. on_horovod – whether the model is run on horovod. training op.
optimizers.optimizers.post_process_gradients(grads_and_vars, summaries, lr, clip_gradients, larc_params)[source]

Applies post processing to gradients, i.e. clipping, LARC, summaries.

optimizers.optimizers.reduce_gradients(grads_and_vars, on_horovod, model=None)[source]

## mp_wrapper¶

class optimizers.mp_wrapper.MixedPrecisionOptimizerWrapper(optimizer, loss_scale=None)[source]

Bases: tensorflow.python.training.optimizer.Optimizer

apply_gradients(grads_and_vars, global_step=None, name=None)[source]

This is the second part of minimize(). It returns an Operation that applies gradients.

Parameters: grads_and_vars – List of (gradient, variable) pairs as returned by compute_gradients(). global_step – Optional Variable to increment by one after the variables have been updated. name – Optional name for the returned operation. Default to the name passed to the Optimizer constructor. An Operation that applies the specified gradients. If global_step was not None, that operation also increments global_step. TypeError – If grads_and_vars is malformed. ValueError – If none of the variables have gradients. RuntimeError – If you should use _distributed_apply() instead.
compute_gradients(loss, var_list=None, gate_gradients=1, aggregation_method=None, colocate_gradients_with_ops=False, grad_loss=None)[source]

Compute gradients of loss for the variables in var_list.

This is the first part of minimize(). It returns a list of (gradient, variable) pairs where “gradient” is the gradient for “variable”. Note that “gradient” can be a Tensor, an IndexedSlices, or None if there is no gradient for the given variable.

Parameters: loss – A Tensor containing the value to minimize or a callable taking no arguments which returns the value to minimize. When eager execution is enabled it must be a callable. var_list – Optional list or tuple of tf.Variable to update to minimize loss. Defaults to the list of variables collected in the graph under the key GraphKeys.TRAINABLE_VARIABLES. gate_gradients – How to gate the computation of gradients. Can be GATE_NONE, GATE_OP, or GATE_GRAPH. aggregation_method – Specifies the method used to combine gradient terms. Valid values are defined in the class AggregationMethod. colocate_gradients_with_ops – If True, try colocating gradients with the corresponding op. grad_loss – Optional. A Tensor holding the gradient computed for loss. A list of (gradient, variable) pairs. Variable is always present, but gradient can be None. TypeError – If var_list contains anything else than Variable objects. ValueError – If some arguments are invalid. RuntimeError – If called with eager execution enabled and loss is not callable.

@compatibility(eager) When eager execution is enabled, gate_gradients, aggregation_method, and colocate_gradients_with_ops are ignored. @end_compatibility

optimizers.mp_wrapper.mp_regularizer_wrapper(regularizer)[source]

## automatic_loss_scaler¶

class optimizers.automatic_loss_scaler.AutomaticLossScaler(algorithm='Backoff', params=None)[source]

Bases: object

SUPPORTED_ALGOS = ['backoff', 'logmax']
static check_grads(grads_and_vars)[source]
loss_scale
update_op(has_nan, amax)[source]
class optimizers.automatic_loss_scaler.BackoffScaler(params)[source]

Bases: object

loss_scale
update_op(has_nan, amax)[source]
class optimizers.automatic_loss_scaler.LogMaxScaler(params)[source]

Bases: object

loss_scale
update_op(has_nan, amax)[source]

## lr_policies¶

Module containing various learning rate policies. Learning rate policy can be any function that takes arbitrary arguments from the config (with additional global_step variable provided automatically) and returns learning rate value for the current step.

optimizers.lr_policies.exp_decay(global_step, learning_rate, decay_steps, decay_rate, use_staircase_decay, begin_decay_at=0, min_lr=0.0)[source]

Exponential decay learning rate policy. This function is equivalent to tensorflow.train.exponential_decay with some additional functionality. Namely, it adds begin_decay_at parameter and min_lr parameter which are the first step to start decaying learning rate and minimal value of the learning rate correspondingly.

Parameters: global_step – global step TensorFlow tensor. learning_rate (float) – initial learning rate to use. decay_steps (int) – number of steps to apply decay for. decay_rate (float) – the rate of the decay. use_staircase_decay (bool) – whether to use staircase decay. begin_decay_at (int) – the first step to start decaying learning rate. min_lr (float) – minimal value of the learning rate. learning rate at step global_step.
optimizers.lr_policies.fixed_lr(global_step, learning_rate)[source]

Fixed learning rate policy. This function always returns learning_rate, ignoring global_step value.

Parameters: global_step – global step TensorFlow tensor (ignored for this policy). learning_rate (float) – fixed learning rate to use. learning rate at step global_step.
optimizers.lr_policies.inv_poly_decay(global_step, learning_rate, decay_steps, min_lr, power=1.0, begin_decay_at=0, warmup_steps=0, name='learning_rate')[source]

Inverse poly decay learning rate policy. lr = initial lr / ( 1+ decay * t)^power This function is similar to tensorflow.train.inv_time_decay with some additional functionality. Namely, it adds : min_lr - end learning rate with 0.00001 power - power begin_decay_at- first step to start decaying learning rate.

Parameters: global_step – global step TensorFlow tensor. learning_rate (float) – initial learning rate to use. decay_steps (int) – number of steps to apply decay for. power (float) – power for inv_time_decay. begin_decay_at (int) – the first step to start decaying learning rate. min_lr (float) – minimal value of the learning rate (same as end_learning_rate TensorFlow parameter). learning rate at step global_step.
optimizers.lr_policies.piecewise_constant(global_step, learning_rate, boundaries, decay_rates, steps_per_epoch=None)[source]

Piecewise constant learning rate decay. When defined in the config, only boundaries and decay_rates need to be provided (other parameters are automatically populated by Model class). boundaries are treated as epochs if num_epochs is provided in the config, otherwise treated as steps.

Parameters: global_step – global step TensorFlow tensor. learning_rate (float) – initial learning rate to use. boundaries (list) – could be either defined in steps (if batches_per_epoch=None) or in epochs if batches_per_epoch parameter is defined. decay_rates – multiplier of the initial learning rate for each boundary. steps_per_epoch – number of batches in one training epoch. If provided, boundaries are treated as epochs, otherwise as steps. learning rate at step global_step.
optimizers.lr_policies.poly_decay(global_step, learning_rate, decay_steps, power=1.0, begin_decay_at=0, min_lr=0.0, warmup_steps=0)[source]

Polynomial decay learning rate policy. This function is equivalent to tensorflow.train.polynomial_decay with some additional functionality. Namely, it adds begin_decay_at parameter which is the first step to start decaying learning rate.

Parameters: global_step – global step TensorFlow tensor. learning_rate (float) – initial learning rate to use. decay_steps (int) – number of steps to apply decay for. power (float) – power for polynomial decay. begin_decay_at (int) – the first step to start decaying learning rate. min_lr (float) – minimal value of the learning rate (same as end_learning_rate TensorFlow parameter). learning rate at step global_step.
optimizers.lr_policies.transformer_policy(global_step, learning_rate, d_model, warmup_steps, max_lr=None, coefficient=1.0, dtype=tf.float32)[source]

Transformer’s learning rate policy from https://arxiv.org/pdf/1706.03762.pdf with a hat (max_lr) (also called “noam” learning rate decay scheme).

Parameters: global_step – global step TensorFlow tensor (ignored for this policy). learning_rate (float) – initial learning rate to use. d_model (int) – model dimensionality. warmup_steps (int) – number of warm-up steps. max_lr (float) – maximal learning rate, i.e. hat. coefficient (float) – optimizer adjustment. Recommended 0.002 if using “Adam” else 1.0. dtype – dtype for this policy. learning rate at step global_step.