TorchFort Configuration Files

The TorchFort library relies on a user-defined YAML configuration file to define several aspects of the training procedure, with specific blocks to control:

  • general properties

  • model properties

  • optimizer properties

  • loss function properties

  • learning rate schedule properties

The following sections define each configuration block and available options.

Common

The following sections list configuration file blocks common to supervised learning and reinforcement learning configuration files.

General Properties

The block in the configuration file defining general properties takes the following structure:

general:
  <option> = <value>

The following table lists the available options:

Option

Data Type

Description

report_frequency

integer

frequency of reported TorchFort training/validation output lines to terminal (default = 0)

enable_wandb_hook

boolean

flag to control whether wandb hook is active (default = false)

verbose

boolean

flag to control verbose output from TorchFort (default = false)

For more information about the wandb hook, see Weights and Biases Support.

Optimizer Properties

The block in the configuration file defining optimizer properties takes the following structure:

optimizer:
  type: <optimizer_type>
  parameters:
    <option> = <value>

The following table lists the available optimizer types:

Optimizer Type

Description

sgd

Stochastic Gradient Descent optimizer

adam

ADAM optimizer

The following table lists the available options by optimizer type:

Optimizer Type

Option

Data Type

Description

sgd

learning_rate

float

learning rate (default = 0.001)

momentum

float

mometum factor (default = 0.0)

dampening

float

dampening for momentum (default = 0.0)

weight_decay

float

weight decay/L2 penalty (default = 0.0)

nesterov

boolean

enables Nesterov momentum (default = false)

adam

learning_rate

float

learning rate (default = 0.001)

beta1

float

coefficient used for computing running average of gradient (default = 0.9)

beta2

float

coefficient use for computing running average of square of gradient (default = 0.999)

weight_decay

float

weight decay/L2 penalty (default = 0.0)

eps

float

term added to denominator to improve numerical stability (default = 1e-8)

amsgrad

boolean

whether to use AMSGrad variant (default = false)

Learning Rate Schedule Properties

The block in the configuration file defining learning rate schedule properties takes the following structure:

lr_scheduler:
  type: <schedule_type>
  parameters:
    <option> = <value>

The following table lists the available schedule types:

Schedule Type

Description

step

Decays learning rate by multiplicative factor every step_size training iterations

multistep

Decays learning rate by multiplicative factor at user-defined training iteration milestones

polynomial

Decays learning rate by polynomial function

cosine_annealing

Decays learning rate using cosine annealing schedule. See PyTorch documentation of torch.optim.lr_scheduler.CosineAnnealingLR for more details.

The following table lists the available options by schedule type:

Schedule Type

Option

Data Type

Description

step

step_size

integer

Number of training steps between learning rate decay

gamma

float

Multiplicative factor of learning rate decay (default = 0.1)

multistep

milestones

list of integer

Training step milestones for learning rate decay

gamma

float

Multiplicative factor of learning rate decay (default = 0.1)

polynomial

total_iters

integer

Number of training iterations to decay the learning rate

power

float

The power of the polynomial (default = 1.0)

cosine_annealing

eta_min

float

Minumum learning rate (default = 0.0)

T_max

float

Maximum number of iterations for decay

Supervised Learning

The following sections list configuration file blocks specific to supervised learning configuration files.

Model Properties

The block in the configuration file defining model properties takes the following structure:

model:
  type: <model_type>
  parameters:
    <option> = <value>

The following table lists the available model types:

Model Type

Description

torchscript

Load a model from an exported TorchScript file

mlp

Use built-in MLP model

The following table lists the available options by model type:

Model Type

Option

Data Type

Description

torchscript

filename

string

path to TorchScript exported model file

mlp

layer_sizes

list of integers

sequence of input/output sizes for linear layers e.g., [16, 32, 4] will create two linear layers with input/output of 16/32 for the first layer and 32/4 for the second layer.

dropout

float

probability of an element to be zeroed in dropout layers (default = 0.0)

Loss Properties

The block in the configuration file defining loss properties takes the following structure:

loss:
  type: <loss_type>
  parameters:
    <option> = <value>

The following table lists the available loss types:

Loss Type

Description

l1

L1/Mean Average Error

mse

Mean Squared Error

The following table lists the available options by loss type:

Loss Type

Option

Data Type

Description

l1

reduction

string

Specifies type of reduction to apply to output. Can be either none, mean or sum. (default = mean)

mse

reduction

string

Specifies type of reduction to apply to output. Can be either none, mean or sum. (default = mean)

Reinforcement Learning

The following sections list configuration file blocks specific to reinforcement learning system configuration files.

Reinforcement Learning Training Algorithm Properties

The block in the configuration file defining algorithm properties takes the following structure:

algorithm:
  type: <algorithm_type>
  parameters:
    <option> = <value>

The following table lists the available algorithm types:

Algorithm Type

Description

ddpg

Deterministic Policy Gradient. See DDPG documentation by OpenAI for details

td3

Twin Delayed DDPG. See TD3 documentation by OpenAI for details

sac

Soft Actor Critic. See SAC documentation by OpenAI for details

The following table lists the available options by algorithm type:

Algorithm Type

Option

Data Type

Description

ddpg

batch_size

integer

batch size used in training

nstep

integer

number of steps for N-step training

nstep_reward_reduction

string

reduction mode for N-step training (see below)

gamma

float

discount factor

rho

boolean

weight average factor for target weights (in some frameworks called rho = 1-tau)

td3

batch_size

integer

batch size used in training

nstep

integer

number of steps for N-step training

nstep_reward_reduction

string

reduction mode for N-step training (see below)

gamma

float

discount factor

rho

float

weight average factor for target weights (in some frameworks called rho = 1-tau)

num_critics

integer

number of critic networks used

policy_lag

integer

update frequency for the policy in units of critic updates

sac

batch_size

integer

batch size used in training

nstep

integer

number of steps for N-step training

nstep_reward_reduction

string

reduction mode for N-step training (see below)

gamma

float

discount factor

alpha

float

entropy regularization coefficient

rho

boolean

weight average factor for target weights (in some frameworks called rho = 1-tau)

policy_lag

integer

update frequency for the policy in units of value updates

The parameter nstep_reward_reduction defines how the reward is accumulated over N-step rollouts. The options are summarized in a table below (\(N\) is the value from parameter nstep described above):

Reduction Mode

Description

sum or sum_no_skip

\(r = \sum_{i=1}^{N^\ast} \gamma^{i-1} r_i\)

mean or mean_no_skip

\(r = \sum_{i=1}^{N^\ast} \gamma^{i-1} r_i / N^\ast\)

weighted_mean or weighted_mean_no_skip

\(r = \sum_{i=1}^{N^\ast} \gamma^{i-1} r_i / (\sum_{k=1}^{N^\ast} \gamma^{k-1})\)

Here, the value of \(N^\ast\) depends on whether reduction with or without skip is being chosen. In case of the former, \(N^\ast = N\) and the replay buffer is searching for trajectories with at least \(N\) steps. If the trajectory terminates earlier, the sample is skipped and a new one is searched. If all trajectories are shorter than \(N\) steps, the replay buffer will never find a suitable sample.

In this case, it is useful to use the modes with the additional suffix _no_skip. In this case, \(N^{\ast}\) in the formulas will be equal to the minimum of \(N\) and the number of steps needed to reach the end of the trajectory. The regular and no-skip modes are both useful in different occasions, so it is important to be clear about how the reward structure has to be designed in order to achieve the desired goals.

Replay Buffer Properties

The block in the configuration file defining algorithm properties takes the following structure:

replay_buffer:
  type: <replay_buffer_type>
  parameters:
    <option> = <value>

Currently, only type uniform is supported. The following table lists the available options:

Replay Buffer Type

Option

Data Type

Description

uniform

min_size

integer

Minimum number of samples before buffer is ready for training

max_size

integer

Maximum capacity

Action Properties

The block in the configuration file defining action properties takes the following structure:

action:
  type: <action_type>
  parameters:
    <option> = <value>

The following table lists the available options for every action type for ddpg and td3 algorithms:

Action Type

Option

Data Type

Description

space_noise or parameter_noise

a_low

float

lower bound for action value

a_high

float

upper bound for action value

clip

float

clip value for training noise

sigma_train

float

standard deviation for gaussian training noise

sigma_explore

float

standard deviation for gaussian exploration noise

adaptive

bool

flag to specify whether the standard deviation should be adaptive

space_noise_ou or parameter_noise_ou

a_low

float

lower bound for action value

a_high

float

upper bound for action value

clip

float

clip value for training noise

sigma_train

float

standard deviation for Ornstein-Uhlenbeck training noise

sigma_explore

float

standard deviation for Ornstein-Uhlenbeck exploration noise

xi

float

mean reversion parameter for Ornstein-Uhlenbeck noise

dt

float

time-step parameter for Ornstein-Uhlenbeck noise

adaptive

bool

flag to specify whether the standard deviation should be adaptive

The meaning for most of these parameters should be evident from looking at the details of the implementations for the various RL algorithms linked above. However, some parameters require a more detailed explanation: in general, the suffix _ou refers to stateful noise of Ornstein-Uhlenbeck type with zero drift. This noise type is often used if correlation between time steps is desired and thus popular in reinforcement learning. Check out the wikipedia page for details.

The prefix space refers to applying the noise to the predicted ation directly. For example, if \(p\) is our (deterministic) policy function, an exploration action using space noise type is obtained by computing

\[\tilde{a} = \mathrm{clip}(p(\theta, s) + \mathcal{N}(0,\sigma_\mathrm{explore}), a_\mathrm{low}, a_\mathrm{high})\]

for any input state \(s\) and policy weights \(\theta\). In case of parameter noise, the noise will be applied to each weight of \(p\) instead. Hence, the noised action is computed via

\[\tilde{a} = \mathrm{clip}(p(\theta + \mathcal{N}(0,\sigma_\mathrm{explore}), s), a_\mathrm{low}, a_\mathrm{high})\]

The parameter adaptive specifies whether the noise variance \(\sigma\) should be taken relative to the magnitude of the action magnitudes or weight magnitudes for space and parameter noise respectively. In terms of the former, this would mean that

\[ \begin{align}\begin{aligned}a &= p(\theta, s)\\\tilde{a} &= \mathrm{clip}(a + \sigma_\mathrm{explore}\,\mathcal{N}(0,\|a\|), a_\mathrm{low}, a_\mathrm{high})\end{aligned}\end{align} \]

and analogous for parameter noise.

Whichever noise type and parameters are the best highly depends on the behavior of the environment and therefore we cannot give a general recommendation.

For algorithm type sac, only action bounds are supported as the noise is built into the algorithm and cannot be customized.

Policy and Critic Properties

The block in the configuration file defining model properties for actor/policy and critic/value are similar to the supervised learning case (see Model Properties). In this case, TorchFort supports different model properties for policy and critic. The block configuration looks as follows:

critic_model:
  type: <critic_model_type>
  parameters:
    <option> = <value>

policy_model:
  type: <policy_model_type>
  parameters:
    <option> = <value>

Refer to the Model Properties for available model types and options.

Note

For algorithms which use multiple critics networks such as TD3, the critic model is copied internally num_critic times and the weights are randomly initialized for each of these models independently.

Note

In case of SAC algorithm, make sure that the policy network not only returns the mean actions value tensor but also the log probability sigma tensor. As an example see the policy function implementation of stable baselines.

Learning Rate Schedule Properties

For reinforcement learning, TorchFort supports different learning rate schedules for policy and critic. The block configuration looks as follows:

critic_lr_scheduler:
  type: <schedule_type>
  parameters:
    <option> = <value>

policy_lr_scheduler:
  type: <schedule_type>
  parameters:
    <option> = <value>

Refer to the Learning Rate Schedule Properties for available scheduler types and options.