Extras

Weights and Biases Support

Weights and Biases (wandb) is a popular tool for monitoring machine learning training workflows. Wandb supports plotting and comparing loss curves and other relevant deep learning metrics, system utilization (including GPU, CPU and memory utilization) and other advanced logging functionalities like uploading images and/or videos, network weights and data artifacts.

Since wandb does currently not offer a C++ interface and thus cannot be called from Fortran or C/C++ directly, we’ve implemented a wandb daemon written in Python instead. This daemon runs as a background process and waits for changes in a log file generated by the TorchFort application. In order to enable wandb support for your TorchFort application, the following steps have to be performed.

Add Custom Metrics Reporting to Application

The TorchFort training routines already provide logging of training step, loss values as well as learning rate which are captured by the wandb daemon. Additional custom metrics can be added manually by the user. For this purpose, the user may add calls of torchfort_wandb_log or torchfort_rl_off_policy_wandb_log for traditional and reinforcement learning applications respectively (see Supervised Learning and Reinforcement Learning for details about why we provide different implementations for these two cases). For more information, see TorchFort C API for C/C++ and TorchFort Fortran API for Fortran applications.

Set up Environment

You need to specify your wandb api token via the environment variable WANDB_API_KEY (see the wandb documentation on available environment variables for details). Furthermore, the daemon needs to know where the the wandb logging data from the TorchFort application will be stored. This can be done by defining the environment variable TORCHFORT_LOGDIR. Lastly, a user0defined wandb logging directory WANDB_LOGGING_DIR can be created to gather all wandb information as well as the config file in a place specific to the run.

Note

The logging directory TORCHFORT_LOGDIR needs to be specified before the daemon and TorchFort application are launched.

Start Background Watcher Process

Now, the wandb daemon process needs to be started. Assuming TorchFort was installed in TORCHFORT_INSTALL_DIR, we can run

python ${TORCHFORT_INSTALL_DIR}/bin/python/wandb_helper.py \
            --wandb_dir=${WANDB_LOGGING_DIR} \
            --wandb_group=<wandb-group-name> \
            --wandb_project=<wandb-project-name> \
            --wandb_entity=<wandb-entity-name> \
            --run_tag=<run-name> \
            --timeout=2400 &

The wandb group, project as well as entity name correspond to the wandb project you are logging to. Those correspond to the respective arguments of wandb.init documented here. Note that the group does not need to exist and will be created during initialization. The run tag can be any alphanumeric string and can be used to identify the specific run on the wandb dashboard. Lastly, the timeout (measured in seconds) determines for how long the background process will wait for changes to appear in ${TORCHFORT_LOGIDR}/torchfort.log before wrapping up the monitoring.

Note

Do not forget to launch the daemon into the background.

Start Your TorchFort Application

In the configuration file for your TorchFort application, make sure to enable wandb logging in the general section by adding or modifying the line enable_wandb_hook: 1. Lastly, start the TorchFort application as usual, e.g.:

./my_torchfort_app arg1 arg2 arg3

The daemone process will pick up the log lines from ${TORCHFORT_LOGIDR}/torchfort.log and display the data on the corresponding job dashboard.

Note

The daemon can finalize the monitoring while the TorchFort application is still running if the timeout is not set sufficiently large, especially for long running applications with very sparse logging.