Evaluation¶
We support many popular benchmarks and it's easy to add new in the future. The following categories of benchmarks are supported
- Math (natural language): e.g. aime24, aime25, hmmt_feb25
- Math (formal language): e.g. minif2f, proofnet, putnam-bench
- Code: e.g. swe-bench, livecodebench
- Scientific knowledge: e.g., hle, scicode, gpqa
- Instruction following: e.g. ifbench, ifeval
- Long-context: e.g. ruler, mrcr
- Tool-calling: e.g. bfcl_v3
See nemo_skills/dataset where each folder is a benchmark we support.
Here is how to run evaluation (using API model as an example,
but same command works with self-hosted models both locally and on slurm).
Make sure that /workspace
is mounted inside of your
cluster config.
Preparing data¶
You need to run the following commands to prepare the data.
If you're only interested in a subset of datasets (e.g. only math-related or code-related), run with
--dataset_groups ...
and if you only need a couple of specific datasets, list them directly e.g.
Note
If you have the repo cloned locally, the data files will be available inside nemo_skills/dataset/<benchmark>/<split>.jsonl
and if you installed from pip, they will be downloaded to wherever the repo is installed, which you can figure out by running
Some benchmarks (e.g. ruler) require extra parameters to be passed to the prepare_data script. Thus you'd need to explicitly
call ns prepare_data <benchmark name>
for them, e.g. for ruler you can use
ns prepare_data ruler --setup=llama_128k --tokenizer_path=meta-llama/Llama-3.1-8B-Instruct --max_seq_length=131072
Running evaluation¶
ns eval \
--cluster=local \
--server_type=openai \
--model=meta/llama-3.1-8b-instruct \
--server_address=https://integrate.api.nvidia.com/v1 \
--benchmarks=gsm8k,human-eval \
--output_dir=/workspace/test-eval
This will run evaluation on gsm8k and human-eval for Llama 3.1 8B model. If you're running
on slurm by default each benchmark is run in a separate job, but you can control this with
--num_jobs
parameter.
After the evaluation is done, the metrics will be printed to stdout and also saved in <output_dir>/metrics.json
.
Note
You can always re-compute and re-print the metrics for all benchmarks in a folder by running
The above command should print the following
---------------------------------------- gsm8k ----------------------------------------
evaluation_mode | num_entries | avg_tokens | gen_seconds | symbolic_correct | no_answer
pass@1 | 1319 | 180 | 164 | 81.96% | 4.93%
------------------------------------------- human-eval -------------------------------------------
evaluation_mode | num_entries | avg_tokens | gen_seconds | passing_base_tests | passing_plus_tests
pass@1 | 164 | 199 | 29 | 64.63% | 60.37%
Using multiple samples¶
You can add :<num repeats>
after the benchmark name to repeat evaluation multiple times with high temperature
that can be used for majority voting or estimating pass@k. E.g. if we run with
ns eval \
--cluster=local \
--server_type=openai \
--model=meta/llama-3.1-8b-instruct \
--server_address=https://integrate.api.nvidia.com/v1 \
--benchmarks gsm8k:4,human-eval:4 \
--output_dir=/workspace/test-eval
you will see the following output after summarizing results
---------------------------------------- gsm8k -----------------------------------------
evaluation_mode | num_entries | avg_tokens | gen_seconds | symbolic_correct | no_answer
pass@1[avg-of-4] | 1319 | 180 | 680 | 80.44% | 6.31%
majority@4 | 1319 | 180 | 680 | 88.40% | 0.15%
pass@4 | 1319 | 180 | 680 | 93.63% | 0.15%
-------------------------------------------- human-eval -------------------------------------------
evaluation_mode | num_entries | avg_tokens | gen_seconds | passing_base_tests | passing_plus_tests
pass@1[avg-of-4] | 164 | 215 | 219 | 64.63% | 59.30%
pass@4 | 164 | 215 | 219 | 79.27% | 74.39%
Customizing evaluations¶
You can customize any part of the evaluation. Here are a few examples
Customize inference parameters¶
Customize prompt¶
You can reference any prompt from nemo_skills/prompt/config without .yaml extension, e.g. to reference nemo_skills/prompt/config/generic/math.yaml use
You can also commit a new prompt file to your git repository and reference it asNote that in this case the full path needs to end with .yaml
!
Customize evaluation parameters¶
Different benchmarks have different evaluation options that you can customize. Here is an example of how to adjust code execution timeout for scicode benchmark
Using data on cluster¶
Some benchmarks (e.g. ruler) have very large input datasets and it's inefficient to prepare them on a local machine and
keep uploading on cluster with every evaluation job. Instead, you can prepare them on cluster directly. To do that,
run prepare_data command with --data_dir
and --cluster
options, e.g.
ns prepare_data \
--data_dir=/workspace/ns-data \
--cluster=slurm \
ruler --setup llama_128k --tokenizer_path meta-llama/Llama-3.1-8B-Instruct --max_seq_length 131072
Then during evaluation, you'd need to provide the same data_dir
argument and it will read the data from cluster
directly. You can also use NEMO_SKILLS_DATA_DIR
environment variable instead of an explicit argument.
Here is an example evaluation command for ruler that uses Python api and data_dir parameter
from nemo_skills.pipeline.cli import eval, wrap_arguments
eval(
ctx=wrap_arguments(""),
cluster="slurm",
model="/hf_models/Meta-Llama-3.1-8B-Instruct",
server_type="sglang",
output_dir="/workspace/eval-ruler",
data_dir="/workspace/ns-data",
benchmarks="ruler.llama_128k",
server_gpus=8,
expname="eval-ruler",
)
How the benchmarks are defined¶
Each benchmark exists as a separate folder inside
nemo_skills/dataset. Inside
those folders there needs to be prepare.py
script which can be run to download and format benchmark
data into a .jsonl input file (or files if it supports multiple splits) that
our scripts can understand. There also needs to be an __init__.py
that defines some default variables
for that benchmark, such as prompt config, evaluation type, metrics class and a few more.
This information is than used inside eval pipeline to initialize default setup (but all arguments can be changed from the command line).
Let's look at gsm8k to understand a bit more how each part of the evaluation works.
Inside nemo_skills/dataset/gsm8k/__init__.py
we see the following
# settings that define how evaluation should be done by default (all can be changed from cmdline)
DATASET_GROUP = 'math'
METRICS_TYPE = "math"
EVAL_ARGS = "++eval_type=math"
GENERATION_ARGS = "++prompt_config=generic/math"
The prompt config and default generation arguments are passed to the
nemo_skills/inference/generate.py and
the default eval args are passed to the
nemo_skills/evaluation/evaluate_results.py.
The dataset group is used by nemo_skills/dataset/prepare.py
to help download only benchmarks from a particular group if --dataset_groups
parameter is used.
Finally, the metrics type is used to pick a metrics class from nemo_skills/evaluation/metrics/map_metrics.py
which is called at the end of the evaluation to compute final scores.
Adding new benchmarks¶
To create a new benchmark follow this process:
- Create a new folder inside nemo_skills/dataset.
- Create
prepare.py
file that will outputs<split>.jsonl
input file(s) in the same folder when run. It can take extra arguments if required. - Create
__init__.py
file in that folder that container default configuration for that benchmark. You typically need to specify only default prompt config inGENERATION_ARGS
and evaluation / metric parameters. But if extra customization is needed for the generation, you can provide a fully custom generation module. See scicode or swe-bench for examples of this. - Create a new evaluation class (if cannot re-use existing one).
- Create a new metrics class ( if cannot re-use existing one).