TorchFort Configuration Files
The TorchFort library relies on a user-defined YAML configuration file to define several aspects of the training procedure, with specific blocks to control:
general properties
model properties
optimizer properties
loss function properties
learning rate schedule properties
The following sections define each configuration block and available options.
Common
The following sections list configuration file blocks common to supervised learning and reinforcement learning configuration files.
General Properties
The block in the configuration file defining general properties takes the following structure:
general:
<option> = <value>
The following table lists the available options:
Option |
Data Type |
Description |
|---|---|---|
|
integer |
frequency of reported TorchFort training/validation output lines to terminal (default = |
|
boolean |
flag to control whether wandb hook is active (default = |
|
boolean |
flag to control verbose output from TorchFort (default = |
|
boolean |
flag to enable CUDA graph capture for training and inference (default = |
For more information about the wandb hook, see Weights and Biases Support.
CUDA Graphs
When enable_cuda_graphs is set to true, TorchFort will capture CUDA graphs for the forward pass (inference)
and the forward + loss + backward pass (training). CUDA graphs can significantly reduce kernel launch overhead
and improve performance for models with many small operations.
Requirements and limitations:
Input tensors must be on GPU and must have consistent data pointers, shapes, and dtypes across all training/inference calls with the captured model. If inputs change after graph capture, an error will be thrown.
The optimizer step and learning rate scheduler updates are not captured in the graph.
A warmup period of 3 iterations is performed before graph capture to ensure stable execution.
Optimizer Properties
The block in the configuration file defining optimizer properties takes the following structure:
optimizer:
type: <optimizer_type>
parameters:
<option> = <value>
general:
<option> = value
The general block is optional.
The following table lists the available optimizer types:
Optimizer Type |
Description |
|---|---|
|
Stochastic Gradient Descent optimizer |
|
ADAM optimizer |
The following table lists the available parameter options by optimizer type:
Optimizer Type |
Option |
Data Type |
Description |
|---|---|---|---|
|
|
float |
learning rate (default = |
|
float |
mometum factor (default = |
|
|
float |
dampening for momentum (default = |
|
|
float |
weight decay/L2 penalty (default = |
|
|
boolean |
enables Nesterov momentum (default = |
|
|
|
float |
learning rate (default = |
|
float |
coefficient used for computing running average of gradient (default = |
|
|
float |
coefficient use for computing running average of square of gradient (default = |
|
|
float |
weight decay/L2 penalty (default = |
|
|
float |
term added to denominator to improve numerical stability (default = |
|
|
boolean |
whether to use AMSGrad variant (default = |
The following table lists the available general options:
Option |
Data Type |
Description |
|---|---|---|
|
integer |
number of training steps to accumulate gradients between optimizer steps (default = |
|
float |
maximum gradient norm for gradient clipping. A value of 0.0 means clipping is disabled (default = |
Learning Rate Schedule Properties
The block in the configuration file defining learning rate schedule properties takes the following structure:
lr_scheduler:
type: <schedule_type>
parameters:
<option> = <value>
The following table lists the available schedule types:
Schedule Type |
Description |
|---|---|
|
Decays learning rate by multiplicative factor every |
|
Decays learning rate by multiplicative factor at user-defined training iteration milestones |
|
Decays learning rate by polynomial function |
|
Decays learning rate using cosine annealing schedule. See PyTorch documentation of torch.optim.lr_scheduler.CosineAnnealingLR for more details. |
The following table lists the available options by schedule type:
Schedule Type |
Option |
Data Type |
Description |
|---|---|---|---|
|
|
integer |
Number of training steps between learning rate decay |
|
float |
Multiplicative factor of learning rate decay (default = |
|
|
|
list of integer |
Training step milestones for learning rate decay |
|
float |
Multiplicative factor of learning rate decay (default = |
|
|
|
integer |
Number of training iterations to decay the learning rate |
|
float |
The power of the polynomial (default = |
|
|
|
float |
Minumum learning rate (default = |
|
float |
Maximum number of iterations for decay |
Supervised Learning
The following sections list configuration file blocks specific to supervised learning configuration files.
Model Properties
The block in the configuration file defining model properties takes the following structure:
model:
type: <model_type>
parameters:
<option> = <value>
The following table lists the available model types:
Model Type |
Description |
|---|---|
|
Load a model from an exported TorchScript file |
|
Use built-in MLP model |
The following table lists the available options by model type:
Model Type |
Option |
Data Type |
Description |
|---|---|---|---|
|
|
string |
path to TorchScript exported model file |
|
|
list of integers |
sequence of input/output sizes for linear layers e.g., |
|
float |
probability of an element to be zeroed in dropout layers (default = |
|
|
bool |
if set, input tensors are reshaped from |
Loss Properties
The block in the configuration file defining loss properties takes the following structure:
loss:
type: <loss_type>
parameters:
<option> = <value>
The following table lists the available loss types:
Loss Type | Description |
||
|---|---|---|
|
Load a loss from an exported TorchScript file |
|
|
L1/Mean Average Error |
|
|
Mean Squared Error |
|
The following table lists the available options by loss type:
Loss Type |
Option |
Data Type |
Description |
|---|---|---|---|
|
|
string |
path to TorchScript exported loss file |
|
|
string |
Specifies type of reduction to apply to output. Can be either |
|
|
string |
Specifies type of reduction to apply to output. Can be either |
Reinforcement Learning
The following sections list configuration file blocks specific to reinforcement learning system configuration files.
Reinforcement Learning Training Algorithm Properties
The block in the configuration file defining algorithm properties takes the following structure:
algorithm:
type: <algorithm_type>
parameters:
<option> = <value>
The following table lists the available algorithm types:
Algorithm Type |
Description |
|---|---|
|
Deterministic Policy Gradient. See DDPG documentation by OpenAI for details |
|
Twin Delayed DDPG. See TD3 documentation by OpenAI for details |
|
Soft Actor Critic. See SAC documentation by OpenAI for details |
|
Proximal Policy Optimization. See PPO documentation by OpenAI for details |
The following table lists the available options by algorithm type:
The parameter nstep_reward_reduction defines how the reward is accumulated over N-step rollouts. The options are summarized in a table below (\(N\) is the value from parameter nstep described above):
Reduction Mode |
Description |
|---|---|
|
\(r = \sum_{i=1}^{N^\ast} \gamma^{i-1} r_i\) |
|
\(r = \sum_{i=1}^{N^\ast} \gamma^{i-1} r_i / N^\ast\) |
|
\(r = \sum_{i=1}^{N^\ast} \gamma^{i-1} r_i / (\sum_{k=1}^{N^\ast} \gamma^{k-1})\) |
Here, the value of \(N^\ast\) depends on whether reduction with or without skip is being chosen. In case of the former, \(N^\ast = N\) and the replay buffer is searching for trajectories with at least \(N\) steps. If the trajectory terminates earlier, the sample is skipped and a new one is searched. If all trajectories are shorter than \(N\) steps, the replay buffer will never find a suitable sample.
In this case, it is useful to use the modes with the additional suffix _no_skip. In this case, \(N^{\ast}\) in the formulas will be equal to the minimum of \(N\) and the number of steps needed to reach the end of the trajectory. The regular and no-skip modes are both useful in different occasions, so it is important to be clear about how the reward structure has to be designed in order to achieve the desired goals.
Replay and Rollout Buffer Properties
The block in the configuration file defining algorithm properties takes the following structure:
replay_buffer:
type: <replay_buffer_type>
parameters:
<option> = <value>
Currently, only type uniform is supported. The following table lists the available options:
Replay Buffer Type |
Option |
Data Type |
Description |
|---|---|---|---|
|
|
integer |
Minimum number of samples before buffer is ready for training |
|
integer |
Maximum capacity |
|
|
integer |
Number of environments |
Note that the effective sizes for each environment is \(\mathrm{min\_size} / \mathrm{n\_envs}\) and \(\mathrm{max\_size} / \mathrm{n\_envs}\). You need to ensure that you can store at least one sample for each environment. However, for better algorithm performance, it is highly advised to provide buffers which can store longer trajectories.
For on-policy algorithms, the block looks as follows:
rollout_buffer:
type: <rollout_buffer_type>
parameters:
<option> = <value>
Currently, only type gae_lambda (General Advantage Estimator) is supported. The following table lists the available options:
Replay Buffer Type |
Option |
Data Type |
Description |
|---|---|---|---|
|
|
integer |
Total number of samples before buffer is ready for training |
|
integer |
Number of environments |
As in the case of the replay buffer, the effective sizes for each environment is \(\mathrm{size} / \mathrm{n\_envs}\). Once the buffer is full (number of samples pushed to the buffer is equal to the buffer size), the general advantage is estimated by integrating the reward over the trajectories for each environment, applying discount factor gae_lambda specified during system construction. Training can therefore only start when the buffer is completely full and the advantage estimates are computed.
Actor Properties
The block in the configuration file defining actor properties takes the following structure:
actor:
type: <action_type>
parameters:
<option> = <value>
The following table lists the available options for every action type for ddpg and td3 algorithms:
The meaning for most of these parameters should be evident from looking at the details of the implementations for the various RL algorithms linked above.
However, some parameters require a more detailed explanation: in general, the suffix _ou refers to stateful noise of Ornstein-Uhlenbeck type with zero drift. This noise type is often used if correlation between time steps is desired and thus popular in reinforcement learning. Check out the wikipedia page for details.
The prefix space refers to applying the noise to the predicted action directly. For example, if \(p\) is our (deterministic) policy function, an exploration action using space noise type is obtained by computing
for any input state \(s\) and policy weights \(\theta\). In case of parameter noise, the noise will be applied to each weight of \(p\) instead. Hence, the noised action is computed via
The parameter adaptive specifies whether the noise variance \(\sigma\) should be taken relative to the magnitude of the action magnitudes or weight magnitudes for space and parameter noise respectively. In terms of the former, this would mean that
and analogous for parameter noise.
Whichever noise type and parameters are the best highly depends on the behavior of the environment and therefore we cannot give a general recommendation.
Note
TD3 target policy smoothing: sigma_train and clip control the noise added to the target actor when computing Bellman targets — this is TD3’s target policy smoothing regularization, not noise applied during rollout collection. These two roles (target smoothing vs. exploration) are intentionally separate and should be tuned independently. For DDPG, sigma_train has no effect as DDPG does not use target policy smoothing.
For algorithm type sac, only action bounds are required as the stochastic policy with squashed Gaussian noise is built into the algorithm. The actor type for SAC is always gaussian (squashed Gaussian policy) and cannot be customized.
For algorithm type ppo, two actor types are supported: gaussian_ac uses a standard Gaussian policy with action clipping, while squashed_gaussian_ac uses a squashed (tanh-bounded) Gaussian policy with action scaling — the latter is recommended when the action space requires strict bounds.
Policy and Critic Properties
The block in the configuration file defining model properties for actor/policy and critic/value are similar to the supervised learning case (see Model Properties). In this case, TorchFort supports different model properties for policy and critic. The block configuration looks as follows:
critic_model:
type: <critic_model_type>
parameters:
<option> = <value>
policy_model:
type: <policy_model_type>
parameters:
<option> = <value>
In case of PPO, the policy and critic model are the same. Therefore, the block configuration looks as follows:
actor_critic_model:
type: <actor_critic_model_type>
parameters:
<option> = <value>
The following table lists the available policy and critic model types for the different training algorithms.
Model Type |
Description |
Allowed for Algorithms |
Type |
|---|---|---|---|
|
Load a model from an exported TorchScript file |
All |
policy_model, critic_model, actor_critic_model |
|
Use built-in MLP model |
DDPG, TD3 |
policy_model |
|
Use built-in critic MLP model |
DDPG, TD3, SAC |
critic_model |
|
Use built-in soft actor critic MLP model |
SAC |
policy_model |
|
Use built-in actor-critic MLP model |
PPO |
actor_critic_model |
The following table lists the available options for each model type:
Model Type |
Option |
Data Type |
Description |
|---|---|---|---|
|
|
string |
path to TorchScript exported model file |
|
|
list of integers |
sequence of input/output sizes for linear layers e.g., |
|
float |
probability of an element to be zeroed in dropout layers (default = |
|
|
bool |
if set, input tensors are reshaped from |
|
|
|
list of integers |
sequence of input/output sizes for linear layers e.g., |
|
float |
probability of an element to be zeroed in dropout layers (default = |
|
|
|
list of integers |
sequence of input/output sizes for linear layers e.g., |
|
float |
probability of an element to be zeroed in dropout layers (default = |
|
|
bool |
if set, the returned variance estimate sigma is a function of the state (default = |
|
|
float |
initial value for the log sigma (default = |
|
|
|
list of integers |
sequence of input/output sizes for linear layers of common encoder part e.g., |
|
list of integers |
sequence of input/output sizes for linear layers of actor part e.g., |
|
|
list of integers |
sequence of input/output sizes for linear layers of value part e.g., |
|
|
float |
probability of an element to be zeroed in dropout layers (default = |
|
|
bool |
if set, the returned variance estimate sigma is a function of the state (default = |
|
|
float |
initial value for the log sigma (default = |
Note
For algorithms which use multiple critics networks such as TD3, the critic model is copied internally num_critic times and the weights are randomly initialized for each of these models independently.
Note
In case of SAC algorithm, make sure that the policy network not only returns the mean actions value tensor but also the log probability sigma tensor. As an example see the policy function implementation of stable baselines.
Note
In case of actor-critic models, the policy network is used for both policy and value function. Those models use a common encoder which only takes the state as input but returns a action mean, action log variance (similar to SAC) as well as value estimates.
Learning Rate Schedule Properties
For reinforcement learning, TorchFort supports different learning rate schedules for policy and critic.
The block configuration for DDPG and TD3 looks as follows:
critic_lr_scheduler:
type: <schedule_type>
parameters:
<option> = <value>
policy_lr_scheduler:
type: <schedule_type>
parameters:
<option> = <value>
SAC Automatic Entropy Tuning
SAC supports automatic tuning of the entropy regularization coefficient \(\alpha\). To enable it,
add an alpha_optimizer block using the same format as the main optimizer block:
alpha_optimizer:
type: <optimizer_type>
parameters:
<option> = <value>
When alpha_optimizer is present, \(\alpha\) becomes a trainable scalar parameter updated to
drive the policy entropy toward target_entropy. The initial value of \(\alpha\) is set by the
alpha parameter in the algorithm block (default 0.0; if left at 0.0 a reasonable default
of 0.01 is used and a warning is emitted). See Optimizer Properties for available
optimizer types and options.
Note
Reward normalization (normalize_rewards: true) is strongly recommended when using
alpha_optimizer, as it keeps Q-values on a consistent scale and makes the automatic entropy
tuning robust across tasks with different reward magnitudes.
An optional learning rate scheduler for \(\alpha\) can also be configured:
alpha_lr_scheduler:
type: <schedule_type>
parameters:
<option> = <value>
Since all parameters are shared between policy and critic for actor-critic models, the following block configuration can be used:
lr_scheduler:
type: <schedule_type>
parameters:
<option> = <value>
Refer to the Learning Rate Schedule Properties for available scheduler types and options.
General Remarks
Example YAML files for training the different algorithms are available in the `tests/rl/configs <<../../tests/rl/configs/>>`_ directory.