Math (natural language)¶
We support a variety of natural language math benchmarks. For all benchmarks in this group the task
is to find an answer to a math problem. This is typically a number or an expression that an LLM is instructed
to put inside a \boxed{}
field.
By default all benchmarks in this group use generic/math prompt config.
How we compare answers¶
Most answers in these benchmarks can be compared using a symbolic checker but a few require using LLM-as-a-judge. By default those benchmarks will use GPT-4.1 and thus require OPENAI_API_KEY to be defined. If you want to host a local judge model instead, you can change benchmark parameters like this
You can see the full list of supported judge parameters by running ns eval --help | grep "judge"
.
Note
The judge task is fairly simple, it only needs to compare expected and predicted answers in the context of the problem. It does not need to check the full solution for correctness. By default we use judge/math prompt for the judge.
The following benchmarks require LLM-as-a-judge:
How we extract answers¶
By default we will extract the answer from the last \boxed{}
field in the generated solution. This is consistent
with our default generic/math prompt config.
We also support arbitrary regex based extraction. E.g., if you use a custom prompt that asks an LLM to put an answer after Final answer:
at the end of the solution, you can use these parameters to match the extraction logic to that prompt
--extra_eval_args="++eval_config.extract_from_boxed=False ++eval_config.extract_regex='Final answer: (.+)$'"
Warning
Most LLMs are trained to put an answer for math problems inside \boxed{}
field. For many models even if you ask
for a different answer format in the prompt, they might not follow this instruction. We thus generally do not
recommend changing extraction logic for these benchmarks.
Supported benchmarks¶
aime25¶
- Benchmark is defined in
nemo_skills/dataset/aime25/__init__.py
- Original benchmark source is here.
aime24¶
- Benchmark is defined in
nemo_skills/dataset/aime24/__init__.py
- Original benchmark source is here.
hmmt_feb25¶
- Benchmark is defined in
nemo_skills/dataset/hmmt_feb25/__init__.py
- Original benchmark source is here.
brumo25¶
- Benchmark is defined in
nemo_skills/dataset/brumo25/__init__.py
- Original benchmark source is here.
comp-math-24-25¶
- Benchmark is defined in
nemo_skills/dataset/comp-math-24-25/__init__.py
- This benchmark is created by us! See https://arxiv.org/abs/2504.16891 for more details.
omni-math¶
- Benchmark is defined in
nemo_skills/dataset/omni-math/__init__.py
- Original benchmark source is here.
math¶
- Benchmark is defined in
nemo_skills/dataset/math/__init__.py
- Original benchmark source is here.
math-500¶
- Benchmark is defined in
nemo_skills/dataset/math-500/__init__.py
- Original benchmark source is here.
gsm8k¶
- Benchmark is defined in
nemo_skills/dataset/gsm8k/__init__.py
- Original benchmark source is here.
amc23¶
- Benchmark is defined in
nemo_skills/dataset/amc23/__init__.py
- Original benchmark source is here.
college_math¶
- Benchmark is defined in
nemo_skills/dataset/college_math/__init__.py
- Original benchmark source is here.
gaokao2023en¶
- Benchmark is defined in
nemo_skills/dataset/gaokao2023en/__init__.py
- Original benchmark source is here.
math-odyssey¶
- Benchmark is defined in
nemo_skills/dataset/math-odyssey/__init__.py
- Original benchmark source is here.
minerva_math¶
- Benchmark is defined in
nemo_skills/dataset/minerva_math/__init__.py
- Original benchmark source is here.
olympiadbench¶
- Benchmark is defined in
nemo_skills/dataset/olympiadbench/__init__.py
- Original benchmark source is here.
algebra222¶
- Benchmark is defined in
nemo_skills/dataset/algebra222/__init__.py
- Original benchmark source is here.
asdiv¶
- Benchmark is defined in
nemo_skills/dataset/asdiv/__init__.py
- Original benchmark source is here.
gsm-plus¶
- Benchmark is defined in
nemo_skills/dataset/gsm-plus/__init__.py
- Original benchmark source is here.
mawps¶
- Benchmark is defined in
nemo_skills/dataset/mawps/__init__.py
- Original benchmark source is here.
svamp¶
- Benchmark is defined in
nemo_skills/dataset/svamp/__init__.py
- Original benchmark source is here.