Skip to content

Dataset construction

Here are the commands you can run to re-create our synthetic dataset. We assume you have /workspace defined in your cluster config and are running commands with a Slurm config. Change all commands accordingly if running locally or using different paths.

Math data

Solution generation

We use problems from OpenMathReasoning dataset. So first, download them using this Python snippet and put inside /workspace/open-reasoning/sdg on your Slurm cluster.

from datasets import concatenate_datasets

dataset = load_dataset("nvidia/OpenMathReasoning")

dataset['cot'] = dataset['cot'].remove_columns(['generation_model', 'generated_solution', 'inference_mode', 'used_in_kaggle'])
dataset['additional_problems'] = dataset['additional_problems'].remove_columns(['generation_model', 'generated_solution', 'inference_mode', 'used_in_kaggle'])
full_data = concatenate_datasets([dataset['cot'], dataset['additional_problems']])

full_data.to_json("math-problems.jsonl")

Next, prepare the DeepSeek-R1-0528 to run on Slurm. Here we assume that model is hosted on 16 H100 GPUs, but other GPU configurations are possible with corresponding modifications to commands.

To download the model you can run the following from /workspace folder on Slurm. We will also need Qwen2.5-32B-Instruct to use as the judge for answer correctness.

huggingface-cli download deepseek-ai/DeepSeek-R1-0528 --local-dir DeepSeek-R1-0528
huggingface-cli download Qwen/Qwen2.5-32B-Instruct --local-dir Qwen2.5-32B-Instruct

The next step is optional, but we recommend sharding the checkpoint to avoid very long loading time.

from nemo_skills.pipeline.cli import run_cmd, wrap_arguments

cmd = (
    "python3 nemo_skills/conversion/save_sharded_state.py "
    "    --model-path=/workspace/DeepSeek-R1-0528 "
    "    --output=/workspace/DeepSeek-R1-0528-tp16 "
    "    --tensor-parallel-size=16 "
    "    --context-len=8192 "
    "    --trust-remote-code "
    "    --nnodes 2 "
    "    --dist-init-addr $SLURM_MASTER_NODE:20000 "
    "    --node-rank $SLURM_PROCID "
)

run_cmd(
    ctx=wrap_arguments(cmd),
    cluster="slurm",
    num_gpus=8,
    num_nodes=2,
    container="sglang",
    log_dir="/workspace/DeepSeek-R1-0528-tp16",
)

Finally, launch the data generation command. You can adjust num_chunks (how many jobs to launch in parallel) and dependent_jobs (how many jobs to launch sequentially in case there is a fixed timeout on cluster) to fit your setup.

from nemo_skills.pipeline.cli import generate, run_cmd, wrap_arguments

cluster = 'slurm'
tokens_to_generate = 32768
num_solutions = 16

# Main generation - this will take a lot of time and GPUs!
# You can select a subset of data to run on if you want to test things
generate(
    ctx=wrap_arguments(
        f"++prompt_config=generic/math "
        f"++inference.temperature=0.6 "
        f"++inference.tokens_to_generate={tokens_to_generate} "
    ),
    cluster=cluster,
    input_file="/workspace/open-reasoning/sdg/math-problems.jsonl",
    output_dir="/workspace/open-reasoning/sdg/solutions",
    expname="r1-0528-math-solutions",
    model="/workspace/DeepSeek-R1-0528-tp16",
    server_type="sglang",
    server_gpus=8,
    server_nodes=2,
    server_args=f"--load-format sharded_state --context-length {tokens_to_generate + 2000}",
    num_random_seeds=num_solutions,
    # set these according to your cluster configuration
    # num_chunks=N,
    # dependent_jobs=M,
)

# Judge step, this one is very fast as it just compares the predicted
# and expected answers for each solution, doesn't check reasoning
generate(
    ctx=wrap_arguments(""),
    cluster=cluster,
    generation_type="math_judge",
    input_dir=f"/workspace/open-reasoning/sdg/solutions",
    output_dir=f"/workspace/open-reasoning/sdg/solutions-judged",
    expname="r1-0528-math-solutions-judge",
    run_after="r1-0528-math-solutions",
    model="/workspace/Qwen2.5-32B-Instruct",
    server_type="sglang",
    server_gpus=8,
    num_random_seeds=num_solutions,
)

# We then change all "expected_answer" values to the majority
# from R1 if there is not a single match. While there are some really
# hard problems for which this will not be correct, we found that
# in most cases when R1 is not able to match GT answer even one time,
# the GT answer itself is not correct.
run_cmd(
    ctx=wrap_arguments(
        "python /nemo_run/code/recipes/openreasoning/scripts/use_majority_if_no_answer.py "
        "    /workspace/open-reasoning/sdg/solutions-judged "
        "    /workspace/open-reasoning/sdg/maj-if-no-correct "
    ),
    cluster=cluster,
    expname="change-to-majority-if-no-correct",
    run_after="r1-0528-math-solutions-judge",
    log_dir="/workspace/open-reasoning/sdg/maj-if-no-correct",
)

# Next we re-judge the data to keep matches with the new majority answer
# (should cover non-string match cases like 0.5 vs 1/2)
generate(
    ctx=wrap_arguments(""),
    cluster=cluster,
    generation_type="math_judge",
    input_dir=f"/workspace/open-reasoning/sdg/maj-if-no-correct",
    output_dir=f"/workspace/open-reasoning/sdg/maj-if-no-correct-judged",
    expname="r1-0528-math-solutions-judge-after-majority",
    run_after="change-to-majority-if-no-correct",
    model="/workspace/Qwen2.5-32B-Instruct",
    server_type="sglang",
    server_gpus=8,
    num_random_seeds=num_solutions,
)

# As the final step we convert this data to the format that can be used for SFT.
# This script will also filter anything not judged as correct
cmd = (
    "python -m nemo_skills.training.prepare_data "
    "    ++prompt_template=qwen-instruct "
    "    ++prompt_config=generic/math "
    "    ++input_files='/workspace/open-reasoning/sdg/maj-if-no-correct-judged/output-rs*.jsonl' "
    "    ++output_path=/workspace/open-reasoning/sft-data-math.jsonl "
    "    ++filters.drop_multi_boxed=false "
    "    ++filters.trim_prefix=false "
    "    ++filters.remove_no_think_tags=true "
    "    ++filters.remove_contaminated=false "  # OpenMathReasoning is already decontaminated
    "    ++filters.remove_len_outlier_solutions=false "
    "    ++filters.remove_len_outlier_problems=false "
    "    ++use_judgement=true "
)
run_cmd(
    ctx=wrap_arguments(cmd),
    cluster=cluster,
    log_dir="/workspace/open-reasoning/sft-data-math-logs",
    expname='prepare-for-sft-math',
    run_after="r1-0528-math-solutions-judge-after-majority",
)

The final data that's ready for training will then be available in /workspace/open-reasoning/sft-data-math.jsonl.

GenSelect data

Coming soon!

Code data

Coming soon!

Science data

Coming soon!