MegatronBridge#
This workload (test_template_name is MegatronBridge) submits training and finetuning tasks based on Megatron-Bridge framework.
Note
This workload has a hard requirement for the HuggingFace Hub token. There are two options:
(recommended) define
HF_TOKENenvironment variableset
cmd_args.hf_tokeneither in Test or Scenario config
Usage Examples#
Test TOML example:
name = "megatron_bridge_qwen_30b"
description = "Megatron-Bridge run via CloudAI SlurmSystem for Qwen3 30B A3B"
test_template_name = "MegatronBridge"
[[git_repos]]
url = "https://github.com/NVIDIA-NeMo/Megatron-Bridge.git"
commit = "v0.3.0"
mount_as = "/opt/Megatron-Bridge"
[cmd_args]
gpu_type = "gb200"
gpus_per_node = 8
num_gpus = 8
# Container can be an NGC/enroot URL (nvcr.io#...) or a local .sqsh path.
container_image = "nvcr.io#nvidia/nemo:26.02.00"
model_family_name = "qwen"
model_recipe_name = "qwen3_30b_a3b"
task = "pretrain"
domain = "llm"
compute_dtype = "fp8_mx"
Test Scenario example:
name = "megatron_bridge_qwen_30b"
[[Tests]]
id = "megatron_bridge_qwen_30b"
test_name = "megatron_bridge_qwen_30b"
num_nodes = "2"
Test-in-Scenario example:
name = "megatron-bridge-test"
[[Tests]]
id = "mbridge.1"
num_nodes = 2
time_limit = "00:30:00"
name = "megatron_bridge_qwen_30b"
description = "Megatron-Bridge run via CloudAI SlurmSystem for Qwen3 30B A3B"
test_template_name = "MegatronBridge"
[[Tests.git_repos]]
url = "https://github.com/NVIDIA-NeMo/Megatron-Bridge.git"
commit = "v0.3.0"
mount_as = "/opt/Megatron-Bridge"
[Tests.cmd_args]
container_image = "nvcr.io#nvidia/nemo:26.02.01"
model_family_name = "qwen"
model_recipe_name = "qwen3_30b_a3b"
gpu_type = "gb200"
gpus_per_node = 8
num_gpus = 8
task = "pretrain"
domain = "llm"
compute_dtype = "fp8_mx"
API Documentation#
Command Arguments#
- class cloudai.workloads.megatron_bridge.megatron_bridge.MegatronBridgeCmdArgs(
- *,
- gpu_type: str = 'gb200',
- log_dir: str = '',
- container_image: str = '',
- num_gpus: int = 8,
- enable_vboost: bool | None = False,
- dryrun: bool | None = False,
- enable_nsys: bool | None = False,
- domain: str | None = None,
- hidden_size: int | None = None,
- num_layers: int | None = None,
- pipeline_model_parallel_layout: str | None = None,
- first_k_dense_replace: int | None = None,
- model_family_name: str = '',
- model_recipe_name: str = '',
- use_recipes: bool | None = None,
- task: str = 'pretrain',
- compute_dtype: str = 'bf16',
- fp8_recipe: str | None = None,
- hf_token: str = '',
- nemo_home: str | None = None,
- wandb_key: str | None = None,
- wandb_project_name: str | None = None,
- wandb_entity_name: str | None = None,
- wandb_experiment_name: str | None = None,
- wandb_save_dir: str | None = None,
- max_retries: int | None = 1,
- use_tokendrop: bool | List[bool] | None = None,
- use_megatron_fsdp: bool | List[bool] | None = None,
- cuda_graph_impl: str | List[str] | None = None,
- cuda_graph_scope: str | List[str] | None = None,
- tp: int | List[int] | None = None,
- pp: int | List[int] | None = None,
- cp: int | List[int] | None = None,
- vp: int | List[int] | None = None,
- ep: int | List[int] | None = None,
- et: int | List[int] | None = None,
- mb: int | List[int] | None = None,
- gb: int | List[int] | None = None,
- seq_length: int | List[int] | None = None,
- lr: float | List[float] | None = None,
- min_lr: float | List[float] | None = None,
- warmup_iters: int | List[int] | None = None,
- pretrained_checkpoint: str | None = None,
- save_dir: str | None = None,
- load_dir: str | None = None,
- save_interval: int | None = None,
- most_recent_k: int | None = None,
- save_config_filepath: str | None = '/nemo_run/configs/ConfigContainer.yaml',
- data: str | None = None,
- dataset_paths: str | List[str] | None = None,
- dataset_root: str | None = None,
- index_mapping_dir: str | None = None,
- dataset_name: str | None = None,
- packed_sequence: bool | None = None,
- head_only: bool | None = None,
- tokenizer_type: str | None = None,
- tokenizer_model: str | None = None,
- vocab_size: int | None = None,
- pytorch_profiler: bool | None = None,
- profiling_start_step: int | None = None,
- profiling_stop_step: int | None = None,
- record_memory_history: bool | None = None,
- profiling_gpu_metrics: bool | None = None,
- profiling_ranks: int | str | List[int] | None = None,
- nsys_trace: str | List[str] | None = None,
- nsys_extra_args: str | List[str] | None = None,
- nccl_ub: bool | List[bool] | None = None,
- moe_a2a_overlap: bool | List[bool] | None = None,
- max_steps: int | None = 10,
- recompute_num_layers: int | List[int] | None = None,
- activation_offload_layers: int | List[int] | None = None,
- recompute_modules: str | List[str] | None = None,
- num_distributed_optimizer_instances: int | None = None,
- config_variant: str | None = None,
- list_config_variants: bool | None = None,
- **extra_data: Any,
Bases:
CmdArgsMegatron-Bridge launcher arguments (translated into setup_experiment.py flags).
Test Definition#
- class cloudai.workloads.megatron_bridge.megatron_bridge.MegatronBridgeTestDefinition(*, name: str, description: str, test_template_name: str, cmd_args: ~cloudai.workloads.megatron_bridge.megatron_bridge.MegatronBridgeCmdArgs, dse_excluded_args: list[str] = <factory>, extra_env_vars: dict[str, str | ~typing.List[str]] = {}, extra_cmd_args: dict[str, str] = {}, extra_container_mounts: list[str] = [], git_repos: list[~cloudai._core.installables.git_repo.GitRepo] = [], nsys: ~cloudai.models.workload.NsysConfiguration | None = None, predictor: ~cloudai.models.workload.PredictorConfig | None = None, agent: str = 'grid_search', agent_steps: int = 1, agent_metrics: list[str] = ['default'], agent_reward_function: str = 'inverse', agent_config: dict[str, ~typing.Any] | None = None, env_params: dict[str, ~cloudai.configurator.env_params.EnvParamSpec] = <factory>, nemo_run_repo: ~cloudai._core.installables.git_repo.GitRepo = GitRepo(url=https://github.com/NVIDIA-NeMo/Run.git, commit=main))[source]#
Bases:
TestDefinitionMegatron-Bridge test definition (CloudAI-managed install + Slurm submission via launcher).
- validator validate_git_repos_has_megatron_bridge_repo » git_repos[source]#
MegatronBridge requires users to pin the Megatron-Bridge repo version via [[git_repos]].
- was_run_successful(
- tr: TestRun,
Ensure that the MBridge script finished correctly.
- The current state of Megatron-Bridge performance scripts makes us running their tool until the very specific
failure. Right before the failure the M-Bridge script saves output metrics as JSON.
At the point of failure the script asks for reference golden values, that we don’t have
Then the script will perform convergence test between provided golden and actual golden - we don’t need it
- property is_domain_randomization_enabled: bool#
at least one
env_paramsannotation.- Type:
Whether the config declares domain randomization
- is_dse_excluded_arg(path: str) bool#
Return whether a dot-separated cmd_args path should be ignored by DSE.
- is_env_sampled(
- cmd_args_path: str,
Whether a cmd_args field is env-sampled (env draws it per trial, not the agent).
- validator validate_env_params » all fields#
Validate env_params annotations against cmd_args.
env_paramsis an annotation: each key names acmd_argsfield whose value is the candidate set (the single source of truth), and the entry carries only how to sample. So each key must name a realcmd_argsfield whose value is a candidate list; a scalar is already fixed, so annotating it is a meaningless label and is rejected here. Whenweightsare declared, the list needs >= 2 values and the weights must align 1:1 with it. Sampling, persistence, the per-trial cmd_args overlay, and the cache key all live inCloudAIGymEnv; keeping this shape check in core lets the overlay stay agent- and workload-agnostic rather than re-implemented per workload.