Training using verl or OpenRLHF¶
Info
Depending on the algorithm/framework, this pipeline starting script is
All extra parameters are passed to
Warning
OpenRLHF support is experimental and incomplete. We use the following custom fork and it might not be easy to switch to official repositories versions.
- OpenRLHF: https://github.com/Kipok/OpenRLHF
The documentation here is incomplete and we advise you to open an issue if you plan to try something that is not covered below to get additional support.
For OpenRLHF, please use the following non-default container vllm: igitman/nemo-skills-vllm:0.6.0
SFT with OpenRLHF¶
Here is an example of running SFT job with OpenRLHF. Our standard SFT data format can be used here.
from nemo_skills.pipeline.cli import wrap_arguments, sft_openrlhf
sft_openrlhf(
ctx=wrap_arguments(""),
cluster="slurm",
expname="test-openrlhf-sft",
output_dir="/workspace/test-openrlhf-sft",
hf_model="/hf_models/Qwen2.5-1.5B-Instruct",
training_data="/data/sft-data.jsonl",
num_gpus=8,
num_nodes=2,
num_training_jobs=1,
)
PPO with OpenRLHF¶
Here is an example of running PPO job with OpenRLHF. Our standard SFT data format can be used here.
from nemo_skills.pipeline.cli import wrap_arguments, ppo_openrlhf
ppo_openrlhf(
ctx=wrap_arguments(
"--ref_num_gpus_per_node=4 "
"--actor_num_gpus_per_node=4 "
"--vllm_num_engines=2 "
"--vllm_tensor_parallel_size=2 "
"--ref_num_nodes=1 "
"--actor_num_nodes=1 "
"--colocate_actor_ref "
"--advantage_estimator=reinforce "
"--remote_rm_url /nemo_run/code/nemo_skills/training/openrlhf/math_reward.py "
),
cluster="slurm",
expname="test-openrlhf-ppo",
output_dir="/workspace/test-openrlhf-ppo",
hf_model="/hf_models/Qwen2.5-1.5B-Instruct",
prompt_data="/data/rl-data.jsonl",
num_gpus=8,
num_nodes=2,
# this is used for the LLM judge
server_gpus=8,
server_type='trtllm',
server_model='/trt_models/qwen2.5-32b-instruct',
num_training_jobs=1,
)
PPO with verl¶
Here is an example of running PPO job with verl. You can use nemo_skills/training/verl/prepare_data.py to convert our standard SFT data format into parquet.
from nemo_skills.pipeline.cli import wrap_arguments, ppo_verl
ppo_verl(
ctx=wrap_arguments(
'++trainer.save_freq=0 '
'++data.train_batch_size=32 '
'++reward_model.compute_score=math-judge '
'++reward_model.reward_manager=batched '
'++data.filter_prompts=False '
'++actor_rollout_ref.rollout.gpu_memory_utilization=0.7 '
'++data.max_response_length=12000 '
'++actor_rollout_ref.rollout.n=64 '
'++actor_rollout_ref.rollout.tensor_model_parallel_size=2 '
),
cluster="slurm",
expname="test-verl-ppo",
output_dir="/workspace/test-verl-ppo",
hf_model="/hf_models/Qwen2.5-1.5B-Instruct",
prompt_data="/data/rl-data.parquet",
num_gpus=8,
num_nodes=2,
# this is used for the LLM judge
server_gpus=8,
server_type='trtllm',
server_model='/trt_models/qwen2.5-32b-instruct',
num_training_jobs=1,
)