Skip to content

Training using verl or OpenRLHF

Info

Depending on the algorithm/framework, this pipeline starting script is

All extra parameters are passed to

Warning

OpenRLHF and verl support is experimental and incomplete. We use the following custom forks and it might not be easy to switch to official repositories versions.

  • OpenRLHF: https://github.com/Kipok/OpenRLHF
  • verl: https://github.com/titu1994/verl

The documentation here is incomplete and we advise you to open an issue if you plan to try something that is not covered below to get additional support.

SFT with OpenRLHF

Here is an example of running SFT job with OpenRLHF. Our standard SFT data format can be used here.

from nemo_skills.pipeline.cli import wrap_arguments, sft_openrlhf

sft_openrlhf(
    ctx=wrap_arguments(""),
    cluster="slurm",
    expname="test-openrlhf-sft",
    output_dir="/workspace/test-openrlhf-sft",
    hf_model="/hf_models/Qwen2.5-1.5B-Instruct",
    training_data="/data/sft-data.jsonl",
    num_gpus=8,
    num_nodes=2,
    num_training_jobs=1,
)

PPO with OpenRLHF

Here is an example of running PPO job with OpenRLHF. Our standard SFT data format can be used here.

from nemo_skills.pipeline.cli import wrap_arguments, ppo_openrlhf

ppo_openrlhf(
    ctx=wrap_arguments(
        "--ref_num_gpus_per_node=4 "
        "--actor_num_gpus_per_node=4 "
        "--vllm_num_engines=2 "
        "--vllm_tensor_parallel_size=2 "
        "--ref_num_nodes=1 "
        "--actor_num_nodes=1 "
        "--colocate_actor_ref "
        "--advantage_estimator=reinforce "
        "--remote_rm_url /nemo_run/code/nemo_skills/training/openrlhf/math_reward.py "
    ),
    cluster="slurm",
    expname="test-openrlhf-ppo",
    output_dir="/workspace/test-openrlhf-ppo",
    hf_model="/hf_models/Qwen2.5-1.5B-Instruct",
    prompt_data="/data/rl-data.jsonl",
    num_gpus=8,
    num_nodes=2,
    # this is used for the LLM judge
    server_gpus=8,
    server_type='trtllm',
    server_model='/trt_models/qwen2.5-32b-instruct',
    num_training_jobs=1,
)

PPO with verl

Here is an example of running PPO job with verl. You can use nemo_skills/training/verl/prepare_data.py to convert our standard SFT data format into parquet.

from nemo_skills.pipeline.cli import wrap_arguments, ppo_verl

ppo_verl(
    ctx=wrap_arguments(
        '++trainer.save_freq=0 '
        '++data.train_batch_size=32 '
        '++reward_model.compute_score=math-judge '
        '++reward_model.reward_manager=batched '
        '++data.filter_prompts=False '
        '++actor_rollout_ref.rollout.gpu_memory_utilization=0.7 '
        '++data.max_response_length=12000 '
        '++actor_rollout_ref.rollout.n=64 '
        '++actor_rollout_ref.rollout.tensor_model_parallel_size=2 '
    ),
    cluster="slurm",
    expname="test-verl-ppo",
    output_dir="/workspace/test-verl-ppo",
    hf_model="/hf_models/Qwen2.5-1.5B-Instruct",
    prompt_data="/data/rl-data.parquet",
    num_gpus=8,
    num_nodes=2,
    # this is used for the LLM judge
    server_gpus=8,
    server_type='trtllm',
    server_model='/trt_models/qwen2.5-32b-instruct',
    num_training_jobs=1,
)