Skip to content

Math (natural language)

This section details how to evaluate natural language math benchmarks. For all benchmarks in this group, the goal is to find an answer to a math problem. Typically, a large language model (LLM) is instructed to put the answer (a number or an expression) inside a \boxed{} field during the generation step.

How we extract answers

After inference from the model, by default we will extract the answer from the last \boxed{} field in the generated solution. This is consistent with our default generic/math prompt config.

We also support arbitrary regex based extraction. E.g., if you use a custom prompt that asks an LLM to put an answer after Final answer: at the end of the solution, you can use these parameters to match the extraction logic to that prompt

    --extra_eval_args="++eval_config.extract_from_boxed=False ++eval_config.extract_regex='Final answer: (.+)$'"

Warning

Most LLMs are trained to put an answer for math problems inside \boxed{} field. For many models even if you ask for a different answer format in the prompt, they might not follow this instruction. We thus generally do not recommend changing extraction logic for these benchmarks.

Using LLM-as-a-judge

While most answers can be compared using a symbolic checker, some require an LLM-as-a-judge to evaluate the equivalence of expressions.

We enable LLM-as-a-judge by default for the following benchmarks

This default setup will use GPT-4.1 and thus require OPENAI_API_KEY to be defined.

You can customize the judge model or enable it for a new benchmark like this

ns eval \
    --cluster=local \
    --server_type=openai \
    --output_dir=/workspace/test-eval \
    --model=meta/llama-3.1-8b-instruct \
    --server_address=https://integrate.api.nvidia.com/v1 \
    --benchmarks=math \
    --judge_model=nvidia/llama-3.1-nemotron-ultra-253b-v1 \
    --judge_server_address=https://integrate.api.nvidia.com/v1 \
    --judge_server_type=openai \
    --judge_generation_type=math_judge
from nemo_skills.pipeline.cli import wrap_arguments, eval

 eval(
    ctx=wrap_arguments(""),
    cluster="local",
    output_dir="/workspace/test-eval",
    server_type="openai",
    model="meta/llama-3.1-8b-instruct",
    server_address="https://integrate.api.nvidia.com/v1",
    benchmarks="math",
    judge_model="nvidia/llama-3.1-nemotron-ultra-253b-v1",
    judge_server_address="https://integrate.api.nvidia.com/v1",
    judge_server_type="openai",
    judge_generation_type="math_judge",
)
from nemo_skills.pipeline.cli import wrap_arguments, eval

 eval(
    ctx=wrap_arguments(""),
    cluster="local",
    output_dir="/workspace/test-eval",
    server_type="openai",
    model="meta/llama-3.1-8b-instruct",
    server_address="https://integrate.api.nvidia.com/v1",
    benchmarks="math",
    judge_model="Qwen/Qwen2.5-32B-Instruct",
    judge_server_type="sglang",
    judge_server_gpus=2,
    judge_generation_type="math_judge",
)

You can see the full list of supported judge parameters by running ns eval --help | grep "judge".

Note

The judge task is fairly simple, it only needs to compare expected and predicted answers in the context of the problem. It does not need to check the full solution for correctness. By default we use judge/math prompt for the judge.

Setting default judge parameters

Alternatively, you can define the judge parameters directly within the dataset itself. This is useful for specifying different judge parameters for each benchmark or when adding new benchmarks. For example, in a benchmark's __init__.py file, you can add default LLM-as-judge model parameters:

JUDGE_PIPELINE_ARGS = {
    "model": "o3-mini-20250131",
    "server_type": "openai",
    "server_address": "https://api.openai.com/v1",
}
JUDGE_ARGS = "++prompt_config=judge/hle ++generation_key=judgement ++add_generation_stats=False"

You can take a look at hle benchmark or omni-math benchmark as an example.

The hle example is using a regular generation module, while omni-math is using a special math_judge generation type. The math judge generation module is almost the same as the main generation module except that it changes the default prompt config, generation key and adds prefill functionality to not run the judge if the answer cannot be extracted or the answers match exactly.

Generally, you should always use math_judge if the model is expected to put the answer in the \boxed{} field. Since this is not the case for our default hle prompt, we have to use a regular generation module there.

In either case you can always customize the judge prompt by setting a new ++prompt_config or any other judge inference parameters (e.g. ++system_message).

Supported benchmarks

aime25

aime24

hmmt_feb25

brumo25

comp-math-24-25

omni-math

math

math-500

gsm8k

amc23

college_math

gaokao2023en

math-odyssey

minerva_math

olympiadbench

algebra222

asdiv

gsm-plus

mawps

svamp

beyond-aime