Model evaluation¶
Here are the commands you can run to reproduce our evaluation numbers.
The commands below are for OpenMath-Nemotron-1.5B model as an example.
We assume you have /workspace
defined in your cluster config and are
executing all commands from that folder locally. Change all commands accordingly
if running on slurm or using different paths.
Interactive Chat Interface
Besides the benchmark numbers shown below, you can also interactively chat with OpenMath models using our chat interface. This allows you to easily test both Chain-of-Thought (CoT) and Tool-Integrated Reasoning (TIR) modes with code execution in a user-friendly web UI.
Note
For small benchmarks such as AIME24 and AIME25 (30 problems each) it is expected to see significant variation across different evaluation reruns. We've seen the difference as large as 6% even for results that are averaged across 64 generations. So please don't expect to see exactly the same numbers as presented in our paper, but they should be within 3-6% of reported results.
Prepare evaluation data¶
Run CoT evaluations¶
ns eval \
--cluster=local \
--model=nvidia/OpenMath-Nemotron-1.5B \
--server_type=sglang \
--server_gpus=1 \
--output_dir=/workspace/openmath-nemotron-1.5b-eval-cot \
--benchmarks=comp-math-24-25:64 \
++prompt_config=generic/math \
++inference.tokens_to_generate=32768 \
++inference.temperature=0.6
For hle-math it's necessary to run LLM-as-a-judge step to get accurate evaluation results. We use the Qwen2.5-32B-Instruct model as the judge which can be specified as follows.
ns eval \
--cluster=local \
--model=nvidia/OpenMath-Nemotron-1.5B \
--server_type=sglang \
--server_gpus=1 \
--output_dir=/workspace/openmath-nemotron-1.5b-eval-cot \
--benchmarks=hle:64 \
--split=math \
++prompt_config=generic/math \
++inference.tokens_to_generate=32768 \
++inference.temperature=0.6 \
--judge_generation_type=math_judge \
--judge_model=Qwen/Qwen2.5-32B-Instruct \
--judge_server_gpus=4 \
--judge_server_type=sglang
Alternatively, you can use an API model like gpt-4o, but the results might be different.
You need to define OPENAI_API_KEY
to use the model.
To use OpenAI models, such as, gpt-4o, make the following changes in the last three lines of the above command:
--judge_model=gpt-4o \
--judge_server_type=openai \
--judge_server_address=https://api.openai.com/v1
This evaluation will take a very long time unless you run on slurm cluster.
To print the metrics run:
ns summarize_results /workspace/openmath-nemotron-1.5b-eval-cot/eval-results/comp-math-24-25 --metric_type math --cluster local
ns summarize_results /workspace/openmath-nemotron-1.5b-eval-cot/eval-results/hle --metric_type math --cluster local
This should print the metrics including both symbolic and judge evaluation.
Run TIR evaluations¶
To get TIR evaluation numbers, replace the generation commands like this
ns eval \
--cluster=local \
--model=nvidia/OpenMath-Nemotron-1.5B \
--server_type=sglang \
--output_dir=/workspace/openmath-nemotron-1.5b-eval-tir \
--benchmarks=comp-math-24-25:64 \
--server_gpus=1 \
--num_jobs=1 \
--with_sandbox \
++code_tags=openmath \
++prompt_config=openmath/tir \
++use_completions_api=True \
++inference.tokens_to_generate=32768 \
++inference.temperature=0.6 \
++code_execution=true \
++server.code_execution.add_remaining_code_executions=true \
++total_code_executions_in_prompt=8
The only exception is for OpenMath-Nemotron-14B-Kaggle you should use the following options instead
ns eval \
--cluster=local \
--model=nvidia/OpenMath-Nemotron-14B-Kaggle \
--server_type=sglang \
--output_dir=/workspace/openmath-nemotron-14b-kaggle-eval-tir \
--benchmarks=comp-math-24-25:64 \
--server_gpus=1 \
--num_jobs=1 \
--with_sandbox \
++code_tags=openmath \
++prompt_config=generic/math \
++use_completions_api=True \
++inference.tokens_to_generate=32768 \
++inference.temperature=0.6 \
++code_execution=true
All other commands are the same as in the CoT part.
Run GenSelect evaluations¶
Here is a sample command to run GenSelect evaluation:
ns genselect \
--preprocess_args="++input_dir=/workspace/openmath-nemotron-1.5b-eval-cot/eval-results-judged/hle" \
--model=nvidia/OpenMath-Nemotron-1.5B \
--output_dir=/workspace/openmath-nemotron-1.5b-eval-cot/self_genselect_hle \
--cluster=local \
--server_type=sglang \
--server_gpus=1 \
--num_random_seeds=64
The output folder will have three folders (apart from log folders):
-
comparison_instances
: This is the folder where input instances for genselect are kept. -
comparison_judgment
: Output of GenSelect judgments. -
hle
/math
: Folder with outputs based on GenSelect's judgments. Ifdataset
is not specified in the command, we create a folder with the namemath
To print the metrics run: