Dataset construction¶
OpenMathReasoning-1 dataset consists of mathematical problems collected from AoPS community forums. Below we describe the pipeline used to create this dataset. All relevant scripts are available in recipes/omr1 folder.
If you don't have a slurm cluster with a large number of GPUs,
you can still try out all the steps of our pipeline by using Nvidia NIM models. We include
a 10-sample subset of the raw data in configs/example-data.txt and you can
switch to that data and NIM models by adding --mode demo
to all the pipeline commands. We also use different models
in this "demo" mode to make it faster, but you can change configs/demo.yaml to pick
any other models supported in https://build.nvidia.com. Make sure to define NVIDIA_API_KEY
environment variable for this to work
(and ignore scraping and model preparation steps as they are not needed when using NIM models).
Finally, please make sure to go through the getting started documentation to make sure you understand how the below commands work and avoid running into errors.
Data scraping¶
There is a great open-source AoPS-Instruct repository where you can find the scripts to scrape the data. There is also a DeepStudentLlama/AoPS-Instruct HF dataset where the raw forum data can be found. While we didn't use that repository/dataset in our work directly, it should produce a similar output to our internal scripts.
To download and preprocess raw data you can run
This script will rename certain columns in the original dataset to align with our scripts, combine forum discussions into
a single string, remove quotes and truncate the discussions that are longer than 24000 tokens. The prepared data will be
saved as raw_aops_data.jsonl
.
The output file should have ~550k rows, so all of the following commands will take a very long time and require a big number of GPUs if you want to run them on full data. If you just want to try out the full pipeline, we recommend to subsample the dataset by e.g. running
mv raw_aops_data.jsonl raw_aops_data_full.jsonl
head -n 1000 raw_aops_data_full.jsonl > raw_aops_data.jsonl
Model conversion¶
Here are the steps to download/convert all models that we used to create this dataset.
Download the models by running this on cluster from the path that is mounted as /hf_models
in your cluster config.
pip install -U "huggingface_hub[cli]"
huggingface-cli download Qwen/Qwen2.5-32B-Instruct --local-dir Qwen2.5-32B-Instruct
huggingface-cli download Qwen/QwQ-32B --local-dir QwQ-32B
huggingface-cli download deepseek-ai/DeepSeek-R1 --local-dir DeepSeek-R1
Convert the model to TensorRT-LLM with the following (you can skip this and make corresponding changes in the pipeline
scripts to run with server_type=vllm
or server_type=sglang
)
ns convert \
--cluster=slurm \
--input_model=/hf_models/Qwen2.5-32B-Instruct \
--output_model=/trt_models/qwen2.5-32b-instruct \
--convert_from=hf \
--convert_to=trtllm \
--num_gpus=8 \
--model_type=qwen \
--hf_model_name=Qwen/Qwen2.5-32B-Instruct \
--max_input_len 24580 \
--max_seq_len 32768
ns convert \
--cluster=slurm \
--input_model=/hf_models/QwQ-32B \
--output_model=/trt_models/qwq-32b \
--convert_from=hf \
--convert_to=trtllm \
--num_gpus=8 \
--model_type=qwen \
--hf_model_name=Qwen/QwQ-32B \
--max_input_len 24580 \
--max_seq_len 32768
At the time of our experiments serving DeepSeek-R1 model with sglang was faster than with TensorRT-LLM, so we do not convert that model, but instead prepare a sharded checkpoint that is much faster to load.
from nemo_skills.pipeline.cli import run_cmd, wrap_arguments
from nemo_skills.pipeline.utils import get_ray_server_cmd
cmd = get_ray_server_cmd(
"python3 nemo_skills/conversion/save_sharded_state.py "
" --model=/hf_models/DeepSeek-R1 "
" --output=/hf_models/DeepSeek-R1-tp16 "
" --tensor-parallel-size=16 "
" --max_model_len=8192 "
" --trust-remote-code "
" --enforce-eager "
)
run_cmd(
ctx=wrap_arguments(cmd),
cluster="slurm",
num_gpus=8,
num_nodes=2,
# we are using vllm's script to shard but the model can be served with sglang
container="vllm",
log_dir="/hf_models/DeepSeek-R1-tp16",
)
Problem generation pipeline¶
Problem generation pipeline consists of the following stages:
- Extract all problems
from the first forum post (
extract_problems
stage). - Classify whether each problem belongs to one of the following categories:
proof question,
binary question,
multiple-choice question,
invalid question
(
classify_problems
stage). - Extract answers
from the forum discussions (
extract_answers
stage). - Convert proof questions
to answer questions (
convert_proofs
stage). - Remove all binary/multiple-choice/invalid problems and merge remaining problems with converted proofs (
merge_data
stage). - Decontaminate the resulting questions with popular math benchmarks (
decontaminate
stage).
You can run the full pipeline with
You can specify a subset of stages using --stages
argument, e.g. --stages extract_problems
or --stages classify_problems,extract_answers
.
If you want to run using Nvidia NIM models on 10 example questions, add --mode demo
.
Solution generation pipeline¶
Solution generation pipeline consists of the following stages:
- Generate solutions for each of the prepared problems (
generate_solutions
stage). - Fill majority answer
for all problems where ground-truth answer is not known (
fill_majority_answer
stage). - Judge answers using an LLM. Only the final answer is compared to the ground-truth (or majority)
answer, not the full solution (
judge_answers
stage). - TODO: generate a new summary
- Filter out all incorrect solutions and prepare the data for SFT (
prepare_for_sft
stage).
You can run the full pipeline using QwQ-32B as solution generation model with
You can specify a subset of stages using --stages
argument and can switch between QwQ and R1 models using --mode full-qwq
or --mode full-r1
.
If you want to run using Nvidia NIM models on 10 example questions, add --mode demo
.