MegatronBridge#

This workload (test_template_name is MegatronBridge) submits training and finetuning tasks based on Megatron-Bridge framework.

Note

This workload has a hard requirement for the HuggingFace Hub token. There are two options:

  • (recommended) define HF_TOKEN environment variable

  • set cmd_args.hf_token either in Test or Scenario config

Usage Examples#

Test TOML example:

name = "megatron_bridge_qwen_30b"
description = "Megatron-Bridge run via CloudAI SlurmSystem for Qwen3 30B A3B"
test_template_name = "MegatronBridge"

[[git_repos]]
url = "https://github.com/NVIDIA-NeMo/Megatron-Bridge.git"
commit = "v0.3.0"
mount_as = "/opt/Megatron-Bridge"

[cmd_args]
gpu_type = "gb200"
gpus_per_node = 8
num_gpus = 8
# Container can be an NGC/enroot URL (nvcr.io#...) or a local .sqsh path.
container_image = "nvcr.io#nvidia/nemo:26.02.00"

model_family_name = "qwen"
model_recipe_name = "qwen3_30b_a3b"
task = "pretrain"
domain = "llm"
compute_dtype = "fp8_mx"

Test Scenario example:

name = "megatron_bridge_qwen_30b"

[[Tests]]
id = "megatron_bridge_qwen_30b"
test_name = "megatron_bridge_qwen_30b"
num_nodes = "2"

Test-in-Scenario example:

name = "megatron-bridge-test"

[[Tests]]
id = "mbridge.1"
num_nodes = 2
time_limit = "00:30:00"

name = "megatron_bridge_qwen_30b"
description = "Megatron-Bridge run via CloudAI SlurmSystem for Qwen3 30B A3B"
test_template_name = "MegatronBridge"

  [[Tests.git_repos]]
  url = "https://github.com/NVIDIA-NeMo/Megatron-Bridge.git"
  commit = "v0.3.0"
  mount_as = "/opt/Megatron-Bridge"

  [Tests.cmd_args]
  container_image = "nvcr.io#nvidia/nemo:26.02.01"
  model_family_name = "qwen"
  model_recipe_name = "qwen3_30b_a3b"

  gpu_type = "gb200"
  gpus_per_node = 8
  num_gpus = 8

  task = "pretrain"
  domain = "llm"
  compute_dtype = "fp8_mx"

API Documentation#

Command Arguments#

class cloudai.workloads.megatron_bridge.megatron_bridge.MegatronBridgeCmdArgs(
*,
gpu_type: str = 'gb200',
log_dir: str = '',
time_limit: str = '00:05:00',
container_image: str = '',
num_gpus: int = 8,
gpus_per_node: int = 8,
enable_vboost: bool | None = False,
dryrun: bool | None = False,
enable_nsys: bool | None = False,
domain: str | None = None,
hidden_size: int | None = None,
num_layers: int | None = None,
pipeline_model_parallel_layout: str | None = None,
first_k_dense_replace: int | None = None,
model_family_name: str = '',
model_recipe_name: str = '',
use_recipes: bool | None = None,
task: str = 'pretrain',
compute_dtype: str = 'bf16',
fp8_recipe: str | None = None,
hf_token: str = '',
nemo_home: str | None = None,
wandb_key: str | None = None,
wandb_project_name: str | None = None,
wandb_entity_name: str | None = None,
wandb_experiment_name: str | None = None,
wandb_save_dir: str | None = None,
max_retries: int | None = 1,
use_tokendrop: bool | List[bool] | None = None,
use_megatron_fsdp: bool | List[bool] | None = None,
cuda_graph_impl: str | List[str] | None = None,
cuda_graph_scope: str | List[str] | None = None,
tp: int | List[int] | None = None,
pp: int | List[int] | None = None,
cp: int | List[int] | None = None,
vp: int | List[int] | None = None,
ep: int | List[int] | None = None,
et: int | List[int] | None = None,
mb: int | List[int] | None = None,
gb: int | List[int] | None = None,
seq_length: int | List[int] | None = None,
lr: float | List[float] | None = None,
min_lr: float | List[float] | None = None,
warmup_iters: int | List[int] | None = None,
pretrained_checkpoint: str | None = None,
save_dir: str | None = None,
load_dir: str | None = None,
save_interval: int | None = None,
most_recent_k: int | None = None,
save_config_filepath: str | None = None,
data: str | None = None,
dataset_paths: str | List[str] | None = None,
dataset_root: str | None = None,
index_mapping_dir: str | None = None,
dataset_name: str | None = None,
packed_sequence: bool | None = None,
head_only: bool | None = None,
tokenizer_type: str | None = None,
tokenizer_model: str | None = None,
vocab_size: int | None = None,
pytorch_profiler: bool | None = None,
profiling_start_step: int | None = None,
profiling_stop_step: int | None = None,
record_memory_history: bool | None = None,
profiling_gpu_metrics: bool | None = None,
profiling_ranks: int | List[int] | None = None,
nsys_trace: str | List[str] | None = None,
nsys_extra_args: str | List[str] | None = None,
nccl_ub: bool | List[bool] | None = None,
moe_a2a_overlap: bool | List[bool] | None = None,
max_steps: int | None = 10,
recompute_num_layers: int | List[int] | None = None,
activation_offload_layers: int | List[int] | None = None,
recompute_modules: str | List[str] | None = None,
num_distributed_optimizer_instances: int | None = None,
config_variant: str | None = None,
list_config_variants: bool | None = None,
**extra_data: Any,
)[source]#

Bases: CmdArgs

Megatron-Bridge launcher arguments (translated into setup_experiment.py flags).

Test Definition#

class cloudai.workloads.megatron_bridge.megatron_bridge.MegatronBridgeTestDefinition(*, name: str, description: str, test_template_name: str, cmd_args: ~cloudai.workloads.megatron_bridge.megatron_bridge.MegatronBridgeCmdArgs, extra_env_vars: dict[str, str | ~typing.List[str]] = {}, extra_cmd_args: dict[str, str] = {}, extra_container_mounts: list[str] = [], git_repos: list[~cloudai._core.installables.GitRepo] = [], nsys: ~cloudai.models.workload.NsysConfiguration | None = None, predictor: ~cloudai.models.workload.PredictorConfig | None = None, agent: str = 'grid_search', agent_steps: int = 1, agent_metrics: list[str] = ['default'], agent_reward_function: str = 'inverse', agent_config: dict[str, ~typing.Any] | None = None, nemo_run_repo: ~cloudai._core.installables.GitRepo = GitRepo(url=https://github.com/NVIDIA-NeMo/Run.git, commit=main))[source]#

Bases: TestDefinition

Megatron-Bridge test definition (CloudAI-managed install + Slurm submission via launcher).

validator validate_git_repos_has_megatron_bridge_repo  »  git_repos[source]#

MegatronBridge requires users to pin the Megatron-Bridge repo version via [[git_repos]].

was_run_successful(
tr: TestRun,
) JobStatusResult[source]#

Ensure that the MBridge script finished correctly.

The current state of Megatron-Bridge performance scripts makes us running their tool until the very specific

failure. Right before the failure the M-Bridge script saves output metrics as JSON.

  • At the point of failure the script asks for reference golden values, that we don’t have

  • Then the script will perform convergence test between provided golden and actual golden - we don’t need it