Skip to content

Math (natural language)

We support a variety of natural language math benchmarks. For all benchmarks in this group the task is to find an answer to a math problem. This is typically a number or an expression that an LLM is instructed to put inside a \boxed{} field.

By default all benchmarks in this group use generic/math prompt config.

How we compare answers

Most answers in these benchmarks can be compared using a symbolic checker but a few require using LLM-as-a-judge. By default those benchmarks will use GPT-4.1 and thus require OPENAI_API_KEY to be defined. If you want to host a local judge model instead, you can change benchmark parameters like this

    --judge_model=Qwen/Qwen2.5-32B-Instruct
    --judge_server_type=sglang
    --judge_server_gpus=2

You can see the full list of supported judge parameters by running ns eval --help | grep "judge".

Note

The judge task is fairly simple, it only needs to compare expected and predicted answers in the context of the problem. It does not need to check the full solution for correctness. By default we use judge/math prompt for the judge.

The following benchmarks require LLM-as-a-judge:

How we extract answers

By default we will extract the answer from the last \boxed{} field in the generated solution. This is consistent with our default generic/math prompt config.

We also support arbitrary regex based extraction. E.g., if you use a custom prompt that asks an LLM to put an answer after Final answer: at the end of the solution, you can use these parameters to match the extraction logic to that prompt

    --extra_eval_args="++eval_config.extract_from_boxed=False ++eval_config.extract_regex='Final answer: (.+)$'"

Warning

Most LLMs are trained to put an answer for math problems inside \boxed{} field. For many models even if you ask for a different answer format in the prompt, they might not follow this instruction. We thus generally do not recommend changing extraction logic for these benchmarks.

Supported benchmarks

aime25

aime24

hmmt_feb25

brumo25

comp-math-24-25

omni-math

math

math-500

gsm8k

amc23

college_math

gaokao2023en

math-odyssey

minerva_math

olympiadbench

algebra222

asdiv

gsm-plus

mawps

svamp