Dataset construction¶
Here are the commands you can run to re-create our synthetic dataset.
We assume you have /workspace
defined in your cluster config and are
running commands with a Slurm config. Change all commands accordingly if running locally or using different paths.
Math data¶
Solution generation¶
We use problems from OpenMathReasoning dataset. So first,
download them using this Python snippet and put inside /workspace/open-reasoning/sdg
on your Slurm cluster.
from datasets import concatenate_datasets
dataset = load_dataset("nvidia/OpenMathReasoning")
dataset['cot'] = dataset['cot'].remove_columns(['generation_model', 'generated_solution', 'inference_mode', 'used_in_kaggle'])
dataset['additional_problems'] = dataset['additional_problems'].remove_columns(['generation_model', 'generated_solution', 'inference_mode', 'used_in_kaggle'])
full_data = concatenate_datasets([dataset['cot'], dataset['additional_problems']])
full_data.to_json("math-problems.jsonl")
Next, prepare the DeepSeek-R1-0528 to run on Slurm. Here we assume that model is hosted on 16 H100 GPUs, but other GPU configurations are possible with corresponding modifications to commands.
To download the model you can run the following from /workspace
folder on Slurm.
We will also need Qwen2.5-32B-Instruct to use as the judge
for answer correctness.
huggingface-cli download deepseek-ai/DeepSeek-R1-0528 --local-dir DeepSeek-R1-0528
huggingface-cli download Qwen/Qwen2.5-32B-Instruct --local-dir Qwen2.5-32B-Instruct
The next step is optional, but we recommend sharding the checkpoint to avoid very long loading time.
from nemo_skills.pipeline.cli import run_cmd, wrap_arguments
cmd = (
"python3 nemo_skills/conversion/save_sharded_state.py "
" --model-path=/workspace/DeepSeek-R1-0528 "
" --output=/workspace/DeepSeek-R1-0528-tp16 "
" --tensor-parallel-size=16 "
" --context-len=8192 "
" --trust-remote-code "
" --nnodes 2 "
" --dist-init-addr $SLURM_MASTER_NODE:20000 "
" --node-rank $SLURM_PROCID "
)
run_cmd(
ctx=wrap_arguments(cmd),
cluster="slurm",
num_gpus=8,
num_nodes=2,
container="sglang",
log_dir="/workspace/DeepSeek-R1-0528-tp16",
)
Finally, launch the data generation command. You can adjust num_chunks
(how many jobs to launch in parallel) and
dependent_jobs
(how many jobs to launch sequentially in case there is a fixed timeout on cluster) to fit your setup.
from nemo_skills.pipeline.cli import generate, run_cmd, wrap_arguments
cluster = 'slurm'
tokens_to_generate = 32768
num_solutions = 16
# Main generation - this will take a lot of time and GPUs!
# You can select a subset of data to run on if you want to test things
generate(
ctx=wrap_arguments(
f"++prompt_config=generic/math "
f"++inference.temperature=0.6 "
f"++inference.tokens_to_generate={tokens_to_generate} "
),
cluster=cluster,
input_file="/workspace/open-reasoning/sdg/math-problems.jsonl",
output_dir="/workspace/open-reasoning/sdg/solutions",
expname="r1-0528-math-solutions",
model="/workspace/DeepSeek-R1-0528-tp16",
server_type="sglang",
server_gpus=8,
server_nodes=2,
server_args=f"--load-format sharded_state --context-length {tokens_to_generate + 2000}",
num_random_seeds=num_solutions,
# set these according to your cluster configuration
# num_chunks=N,
# dependent_jobs=M,
)
# Judge step, this one is very fast as it just compares the predicted
# and expected answers for each solution, doesn't check reasoning
generate(
ctx=wrap_arguments(""),
cluster=cluster,
generation_type="math_judge",
input_dir=f"/workspace/open-reasoning/sdg/solutions",
output_dir=f"/workspace/open-reasoning/sdg/solutions-judged",
expname="r1-0528-math-solutions-judge",
run_after="r1-0528-math-solutions",
model="/workspace/Qwen2.5-32B-Instruct",
server_type="sglang",
server_gpus=8,
num_random_seeds=num_solutions,
)
# We then change all "expected_answer" values to the majority
# from R1 if there is not a single match. While there are some really
# hard problems for which this will not be correct, we found that
# in most cases when R1 is not able to match GT answer even one time,
# the GT answer itself is not correct.
run_cmd(
ctx=wrap_arguments(
"python /nemo_run/code/recipes/openreasoning/scripts/use_majority_if_no_answer.py "
" /workspace/open-reasoning/sdg/solutions-judged "
" /workspace/open-reasoning/sdg/maj-if-no-correct "
),
cluster=cluster,
expname="change-to-majority-if-no-correct",
run_after="r1-0528-math-solutions-judge",
log_dir="/workspace/open-reasoning/sdg/maj-if-no-correct",
)
# Next we re-judge the data to keep matches with the new majority answer
# (should cover non-string match cases like 0.5 vs 1/2)
generate(
ctx=wrap_arguments(""),
cluster=cluster,
generation_type="math_judge",
input_dir=f"/workspace/open-reasoning/sdg/maj-if-no-correct",
output_dir=f"/workspace/open-reasoning/sdg/maj-if-no-correct-judged",
expname="r1-0528-math-solutions-judge-after-majority",
run_after="change-to-majority-if-no-correct",
model="/workspace/Qwen2.5-32B-Instruct",
server_type="sglang",
server_gpus=8,
num_random_seeds=num_solutions,
)
# As the final step we convert this data to the format that can be used for SFT.
# This script will also filter anything not judged as correct
cmd = (
"python -m nemo_skills.training.prepare_data "
" ++prompt_template=qwen-instruct "
" ++prompt_config=generic/math "
" ++input_files='/workspace/open-reasoning/sdg/maj-if-no-correct-judged/output-rs*.jsonl' "
" ++output_path=/workspace/open-reasoning/sft-data-math.jsonl "
" ++filters.drop_multi_boxed=false "
" ++filters.trim_prefix=false "
" ++filters.remove_no_think_tags=true "
" ++filters.remove_contaminated=false " # OpenMathReasoning is already decontaminated
" ++filters.remove_len_outlier_solutions=false "
" ++filters.remove_len_outlier_problems=false "
" ++use_judgement=true "
)
run_cmd(
ctx=wrap_arguments(cmd),
cluster=cluster,
log_dir="/workspace/open-reasoning/sft-data-math-logs",
expname='prepare-for-sft-math',
run_after="r1-0528-math-solutions-judge-after-majority",
)
The final data that's ready for training will then be available in /workspace/open-reasoning/sft-data-math.jsonl
.
GenSelect data¶
Coming soon!
Code data¶
Coming soon!
Science data¶
Coming soon!