MegatronBridge#
This workload (test_template_name is MegatronBridge) submits training and finetuning tasks based on Megatron-Bridge framework.
Note
This workload has a hard requirement for the HuggingFace Hub token. There are two options:
(recommended) define
HF_TOKENenvironment variableset
cmd_args.hf_tokeneither in Test or Scenario config
Usage Examples#
Test TOML example:
name = "megatron_bridge_qwen_30b"
description = "Megatron-Bridge run via CloudAI SlurmSystem for Qwen3 30B A3B"
test_template_name = "MegatronBridge"
[[git_repos]]
url = "https://github.com/NVIDIA-NeMo/Megatron-Bridge.git"
commit = "v0.3.0"
mount_as = "/opt/Megatron-Bridge"
[cmd_args]
gpu_type = "gb200"
gpus_per_node = 8
num_gpus = 8
# Container can be an NGC/enroot URL (nvcr.io#...) or a local .sqsh path.
container_image = "nvcr.io#nvidia/nemo:26.02.00"
model_family_name = "qwen"
model_recipe_name = "qwen3_30b_a3b"
task = "pretrain"
domain = "llm"
compute_dtype = "fp8_mx"
Test Scenario example:
name = "megatron_bridge_qwen_30b"
[[Tests]]
id = "megatron_bridge_qwen_30b"
test_name = "megatron_bridge_qwen_30b"
num_nodes = "2"
Test-in-Scenario example:
name = "megatron-bridge-test"
[[Tests]]
id = "mbridge.1"
num_nodes = 2
time_limit = "00:30:00"
name = "megatron_bridge_qwen_30b"
description = "Megatron-Bridge run via CloudAI SlurmSystem for Qwen3 30B A3B"
test_template_name = "MegatronBridge"
[[Tests.git_repos]]
url = "https://github.com/NVIDIA-NeMo/Megatron-Bridge.git"
commit = "v0.3.0"
mount_as = "/opt/Megatron-Bridge"
[Tests.cmd_args]
container_image = "nvcr.io#nvidia/nemo:26.02.01"
model_family_name = "qwen"
model_recipe_name = "qwen3_30b_a3b"
gpu_type = "gb200"
gpus_per_node = 8
num_gpus = 8
task = "pretrain"
domain = "llm"
compute_dtype = "fp8_mx"
API Documentation#
Command Arguments#
- class cloudai.workloads.megatron_bridge.megatron_bridge.MegatronBridgeCmdArgs(
- *,
- gpu_type: str = 'gb200',
- log_dir: str = '',
- time_limit: str = '00:05:00',
- container_image: str = '',
- num_gpus: int = 8,
- gpus_per_node: int = 8,
- enable_vboost: bool | None = False,
- dryrun: bool | None = False,
- enable_nsys: bool | None = False,
- domain: str | None = None,
- hidden_size: int | None = None,
- num_layers: int | None = None,
- pipeline_model_parallel_layout: str | None = None,
- first_k_dense_replace: int | None = None,
- model_family_name: str = '',
- model_recipe_name: str = '',
- use_recipes: bool | None = None,
- task: str = 'pretrain',
- compute_dtype: str = 'bf16',
- fp8_recipe: str | None = None,
- hf_token: str = '',
- nemo_home: str | None = None,
- wandb_key: str | None = None,
- wandb_project_name: str | None = None,
- wandb_entity_name: str | None = None,
- wandb_experiment_name: str | None = None,
- wandb_save_dir: str | None = None,
- max_retries: int | None = 1,
- use_tokendrop: bool | List[bool] | None = None,
- use_megatron_fsdp: bool | List[bool] | None = None,
- cuda_graph_impl: str | List[str] | None = None,
- cuda_graph_scope: str | List[str] | None = None,
- tp: int | List[int] | None = None,
- pp: int | List[int] | None = None,
- cp: int | List[int] | None = None,
- vp: int | List[int] | None = None,
- ep: int | List[int] | None = None,
- et: int | List[int] | None = None,
- mb: int | List[int] | None = None,
- gb: int | List[int] | None = None,
- seq_length: int | List[int] | None = None,
- lr: float | List[float] | None = None,
- min_lr: float | List[float] | None = None,
- warmup_iters: int | List[int] | None = None,
- pretrained_checkpoint: str | None = None,
- save_dir: str | None = None,
- load_dir: str | None = None,
- save_interval: int | None = None,
- most_recent_k: int | None = None,
- save_config_filepath: str | None = None,
- data: str | None = None,
- dataset_paths: str | List[str] | None = None,
- dataset_root: str | None = None,
- index_mapping_dir: str | None = None,
- dataset_name: str | None = None,
- packed_sequence: bool | None = None,
- head_only: bool | None = None,
- tokenizer_type: str | None = None,
- tokenizer_model: str | None = None,
- vocab_size: int | None = None,
- pytorch_profiler: bool | None = None,
- profiling_start_step: int | None = None,
- profiling_stop_step: int | None = None,
- record_memory_history: bool | None = None,
- profiling_gpu_metrics: bool | None = None,
- profiling_ranks: int | List[int] | None = None,
- nsys_trace: str | List[str] | None = None,
- nsys_extra_args: str | List[str] | None = None,
- nccl_ub: bool | List[bool] | None = None,
- moe_a2a_overlap: bool | List[bool] | None = None,
- max_steps: int | None = 10,
- recompute_num_layers: int | List[int] | None = None,
- activation_offload_layers: int | List[int] | None = None,
- recompute_modules: str | List[str] | None = None,
- num_distributed_optimizer_instances: int | None = None,
- config_variant: str | None = None,
- list_config_variants: bool | None = None,
- **extra_data: Any,
Bases:
CmdArgsMegatron-Bridge launcher arguments (translated into setup_experiment.py flags).
Test Definition#
- class cloudai.workloads.megatron_bridge.megatron_bridge.MegatronBridgeTestDefinition(*, name: str, description: str, test_template_name: str, cmd_args: ~cloudai.workloads.megatron_bridge.megatron_bridge.MegatronBridgeCmdArgs, extra_env_vars: dict[str, str | ~typing.List[str]] = {}, extra_cmd_args: dict[str, str] = {}, extra_container_mounts: list[str] = [], git_repos: list[~cloudai._core.installables.GitRepo] = [], nsys: ~cloudai.models.workload.NsysConfiguration | None = None, predictor: ~cloudai.models.workload.PredictorConfig | None = None, agent: str = 'grid_search', agent_steps: int = 1, agent_metrics: list[str] = ['default'], agent_reward_function: str = 'inverse', agent_config: dict[str, ~typing.Any] | None = None, nemo_run_repo: ~cloudai._core.installables.GitRepo = GitRepo(url=https://github.com/NVIDIA-NeMo/Run.git, commit=main))[source]#
Bases:
TestDefinitionMegatron-Bridge test definition (CloudAI-managed install + Slurm submission via launcher).
- validator validate_git_repos_has_megatron_bridge_repo » git_repos[source]#
MegatronBridge requires users to pin the Megatron-Bridge repo version via [[git_repos]].
- was_run_successful(
- tr: TestRun,
Ensure that the MBridge script finished correctly.
- The current state of Megatron-Bridge performance scripts makes us running their tool until the very specific
failure. Right before the failure the M-Bridge script saves output metrics as JSON.
At the point of failure the script asks for reference golden values, that we don’t have
Then the script will perform convergence test between provided golden and actual golden - we don’t need it