Model evaluation¶

Here are the commands you can run to reproduce our evaluation numbers. The commands below are for OpenMath-Nemotron-1.5B model as an example. We assume you have /workspace defined in your cluster config and are executing all commands from that folder locally. Change all commands accordingly if running on slurm or using different paths.

Interactive Chat Interface

Besides the benchmark numbers shown below, you can also interactively chat with OpenMath models using our chat interface. This allows you to easily test both Chain-of-Thought (CoT) and Tool-Integrated Reasoning (TIR) modes with code execution in a user-friendly web UI.

Note

For small benchmarks such as AIME24 and AIME25 (30 problems each) it is expected to see significant variation across different evaluation reruns. We've seen the difference as large as 6% even for results that are averaged across 64 generations. So please don't expect to see exactly the same numbers as presented in our paper, but they should be within 3-6% of reported results.

Download models¶

Get the model from HF. E.g.

pip install -U "huggingface_hub[cli]"
huggingface-cli download nvidia/OpenMath-Nemotron-1.5B --local-dir OpenMath-Nemotron-1.5B

Convert to TensorRT-LLM¶

Convert the model to TensorRT-LLM format. This is optional, but highly recommended for more exact results and faster inference. If you skip it, replace --server_type trtllm with --server-type vllm (or sglang) in the commands below and change model path to /workspace/OpenMath-Nemotron-1.5B.

ns convert \
    --cluster=local \
    --input_model=/workspace/OpenMath-Nemotron-1.5B \
    --output_model=/workspace/openmath-nemotron-1.5b-trtllm \
    --convert_from=hf \
    --convert_to=trtllm \
    --model_type=qwen \
    --num_gpus=1 \
    --hf_model_name=nvidia/OpenMath-Nemotron-1.5B \
    --max_input_len 50000 \
    --max_seq_len 50000

We are converted with longer length since HLE-math benchmark has a few very long prompts. You can change the number of GPUs if you have more than 1, but don't use more than 4 for 1.5B and 7B models.

Prepare evaluation data¶

ns prepare_data comp-math-24-25 hle

Run CoT evaluations¶

ns eval \
    --cluster=local \
    --model=/workspace/openmath-nemotron-1.5b-trtllm \
    --server_type=trtllm \
    --output_dir=/workspace/openmath-nemotron-1.5b-eval-cot \
    --benchmarks=comp-math-24-25:64 \
    --server_gpus=1 \
    --num_jobs=1 \
    ++prompt_template=qwen-instruct \
    ++prompt_config=generic/math \
    ++inference.tokens_to_generate=32768 \
    ++inference.temperature=0.6

ns eval \
    --cluster=local \
    --model=/workspace/openmath-nemotron-1.5b-trtllm \
    --server_type=trtllm \
    --output_dir=/workspace/openmath-nemotron-1.5b-eval-cot \
    --benchmarks=hle:64 \
    --server_gpus=1 \
    --num_jobs=1 \
    --split=math \
    ++prompt_template=qwen-instruct \
    ++prompt_config=generic/math \
    ++inference.tokens_to_generate=32768 \
    ++inference.temperature=0.6

This will take a very long time unless you run on slurm cluster. If running on slurm, you can set --num_jobs to a bigger number of -1 to run each benchmark in a separate node. The number of GPUs need to match what you used in the conversion command.

For comp-math-24-25 our symbolic checker is good enough, so we can see the results right away by running

ns summarize_results /workspace/openmath-nemotron-1.5b-eval-cot/eval-results/comp-math-24-25 --metric_type math --cluster local

------------------- comp-math-24-25-all --------------------
evaluation_mode | num_entries | symbolic_correct | no_answer
majority@64     | 256         | 58.20%           | 0.00%
pass@64         | 256         | 75.39%           | 0.00%
pass@1[64]      | 256         | 44.59%           | 0.00%


------------------ comp-math-24-25-aime25 ------------------
evaluation_mode | num_entries | symbolic_correct | no_answer
majority@64     | 30          | 66.67%           | 0.00%
pass@64         | 30          | 83.33%           | 0.00%
pass@1[64]      | 30          | 50.52%           | 0.00%


------------------ comp-math-24-25-aime24 ------------------
evaluation_mode | num_entries | symbolic_correct | no_answer
majority@64     | 30          | 80.00%           | 0.00%
pass@64         | 30          | 80.00%           | 0.00%
pass@1[64]      | 30          | 62.60%           | 0.00%


---------------- comp-math-24-25-hmmt-24-25 ----------------
evaluation_mode | num_entries | symbolic_correct | no_answer
majority@64     | 196         | 53.57%           | 0.00%
pass@64         | 196         | 73.47%           | 0.00%
pass@1[64]      | 196         | 40.92%           | 0.00%

For hle-math it's necessary to run LLM-as-a-judge step to get accurate evaluation results. We used Qwen2.5-32B-Instruct which you can run with the following command, assuming you have the model downloaded and converted locally or on cluster.

ns generate \
    --generation_type=math_judge \
    --cluster=local \
    --model=/trt_models/qwen2.5-32b-instruct \
    --server_type=trtllm \
    --server_gpus=4 \
    --output_dir=/workspace/openmath-nemotron-1.5b-eval-cot/eval-results-judged/hle \
    --input_dir=/workspace/openmath-nemotron-1.5b-eval-cot/eval-results/hle

Alternatively, you can use an API model like gpt-4o, but the results might be different. You need to define OPENAI_API_KEY for the command below to work.

ns generate \
    --generation_type=math_judge \
    --cluster=local \
    --model=gpt-4o \
    --server_type=openai \
    --server_address=https://api.openai.com/v1 \
    --output_dir=/workspace/openmath-nemotron-1.5b-eval-cot/eval-results-judged/hle \
    --input_dir=/workspace/openmath-nemotron-1.5b-eval-cot/eval-results/hle

To print the metrics run

ns summarize_results /workspace/openmath-nemotron-1.5b-eval-cot/eval-results-judged/hle --metric_type math --cluster local

This should print the metrics including both symbolic and judge evaluation.

------------------------------------------------ hle -----------------------------------------------
evaluation_mode | num_entries | symbolic_correct | judge_correct | both_correct | any_correct | no_answer
majority@64     | 975         | 0.82%            | 5.41%         | 0.72%        | 5.41%       | 0.00%
pass@64         | 975         | 14.05%           | 38.36%        | 13.85%       | 38.56%      | 0.00%
pass@1[64]      | 975         | 1.18%            | 5.41%         | 3.06%        | 3.53%       | 0.00%

Run TIR evaluations¶

To get TIR evaluation numbers, replace the generation commands like this

ns eval \
    --cluster=local \
    --model=/workspace/openmath-nemotron-1.5b-trtllm \
    --server_type=trtllm \
    --output_dir=/workspace/openmath-nemotron-1.5b-eval-tir \
    --benchmarks=comp-math-24-25:64 \
    --server_gpus=1 \
    --num_jobs=1 \
    --with_sandbox \
    ++code_tags=openmath \
    ++prompt_template=qwen-instruct \
    ++prompt_config=openmath/tir \
    ++inference.tokens_to_generate=32768 \
    ++inference.temperature=0.6 \
    ++code_execution=true \
    ++server.code_execution.add_remaining_code_executions=true \
    ++total_code_executions_in_prompt=8

The only exception is for OpenMath-Nemotron-14B-Kaggle you should use the following options instead

ns eval \
    --cluster=local \
    --model=/workspace/openmath-nemotron-14b-kaggle-trtllm \
    --server_type=trtllm \
    --output_dir=/workspace/openmath-nemotron-14b-kaggle-eval-tir \
    --benchmarks=comp-math-24-25:64 \
    --server_gpus=1 \
    --num_jobs=1 \
    --with_sandbox \
    ++code_tags=openmath \
    ++prompt_template=qwen-instruct \
    ++prompt_config=generic/math \
    ++inference.tokens_to_generate=32768 \
    ++inference.temperature=0.6 \
    ++code_execution=true

All other commands are the same as in the CoT part.

Run GenSelect evaluations¶

Here is a sample command to run GenSelect evaluation:

ns genselect \
    --preprocess_args="++input_dir=/workspace/openmath-nemotron-1.5b-eval-cot/eval-results-judged/hle" \
    --model=/trt_models/openmath-nemotron-1.5b \
    ++prompt_template=qwen-instruct \
    --output_dir=/workspace/openmath-nemotron-1.5b-eval-cot/self_genselect_hle \
    --cluster=local \
    --server_type=trtllm \
    --server_gpus=1 \
    --num_random_seeds=64

The output folder will have three folders (apart from log folders):

comparison_instances: This is the folder where input instances for genselect are kept.
comparison_judgment: Output of GenSelect judgments.
hle / math: Folder with outputs based on GenSelect's judgments. If dataset is not specified in the command, we create a folder with the name math

To print the metrics run:

ns summarize_results \
  /workspace/openmath-nemotron-1.5b-eval-cot/self_genselect_hle/hle \
  --metric_type math \
  --cluster local