MegatronBridge#

This workload (test_template_name is MegatronBridge) submits training and finetuning tasks based on Megatron-Bridge framework.

Note

This workload has a hard requirement for the HuggingFace Hub token. There are two options:

(recommended) define HF_TOKEN environment variable
set cmd_args.hf_token either in Test or Scenario config

Usage Examples#

Test TOML example:

name = "megatron_bridge_qwen_30b"
description = "Megatron-Bridge run via CloudAI SlurmSystem for Qwen3 30B A3B"
test_template_name = "MegatronBridge"

[[git_repos]]
url = "https://github.com/NVIDIA-NeMo/Megatron-Bridge.git"
commit = "v0.3.0"
mount_as = "/opt/Megatron-Bridge"

[cmd_args]
gpu_type = "gb200"
gpus_per_node = 8
num_gpus = 8
# Container can be an NGC/enroot URL (nvcr.io#...) or a local .sqsh path.
container_image = "nvcr.io#nvidia/nemo:26.02.00"

model_family_name = "qwen"
model_recipe_name = "qwen3_30b_a3b"
task = "pretrain"
domain = "llm"
compute_dtype = "fp8_mx"

Test Scenario example:

name = "megatron_bridge_qwen_30b"

[[Tests]]
id = "megatron_bridge_qwen_30b"
test_name = "megatron_bridge_qwen_30b"
num_nodes = "2"

Test-in-Scenario example:

name = "megatron-bridge-test"

[[Tests]]
id = "mbridge.1"
num_nodes = 2
time_limit = "00:30:00"

name = "megatron_bridge_qwen_30b"
description = "Megatron-Bridge run via CloudAI SlurmSystem for Qwen3 30B A3B"
test_template_name = "MegatronBridge"

  [[Tests.git_repos]]
  url = "https://github.com/NVIDIA-NeMo/Megatron-Bridge.git"
  commit = "v0.3.0"
  mount_as = "/opt/Megatron-Bridge"

  [Tests.cmd_args]
  container_image = "nvcr.io#nvidia/nemo:26.02.01"
  model_family_name = "qwen"
  model_recipe_name = "qwen3_30b_a3b"

  gpu_type = "gb200"
  gpus_per_node = 8
  num_gpus = 8

  task = "pretrain"
  domain = "llm"
  compute_dtype = "fp8_mx"

API Documentation#

Command Arguments#

Bases: CmdArgs

Megatron-Bridge launcher arguments (translated into setup_experiment.py flags).

Test Definition#

class cloudai.workloads.megatron_bridge.megatron_bridge.MegatronBridgeTestDefinition(*, name: str, description: str, test_template_name: str, cmd_args: ~cloudai.workloads.megatron_bridge.megatron_bridge.MegatronBridgeCmdArgs, extra_env_vars: dict[str, str | ~typing.List[str]] = {}, extra_cmd_args: dict[str, str] = {}, extra_container_mounts: list[str] = [], git_repos: list[~cloudai._core.installables.GitRepo] = [], nsys: ~cloudai.models.workload.NsysConfiguration | None = None, predictor: ~cloudai.models.workload.PredictorConfig | None = None, agent: str = 'grid_search', agent_steps: int = 1, agent_metrics: list[str] = ['default'], agent_reward_function: str = 'inverse', agent_config: dict[str, ~typing.Any] | None = None, nemo_run_repo: ~cloudai._core.installables.GitRepo = GitRepo(url=https://github.com/NVIDIA-NeMo/Run.git, commit=main))[source]#

Bases: TestDefinition

Megatron-Bridge test definition (CloudAI-managed install + Slurm submission via launcher).

validator validate_git_repos_has_megatron_bridge_repo » git_repos[source]#: MegatronBridge requires users to pin the Megatron-Bridge repo version via [[git_repos]].

was_run_successful( tr: TestRun, ) → JobStatusResult[source]#

Ensure that the MBridge script finished correctly.

The current state of Megatron-Bridge performance scripts makes us running their tool until the very specific: failure. Right before the failure the M-Bridge script saves output metrics as JSON.

At the point of failure the script asks for reference golden values, that we don’t have
Then the script will perform convergence test between provided golden and actual golden - we don’t need it