Skip to content

Evaluation

We support many popular benchmarks and it's easy to add new in the future. The following categories of benchmarks are supported

See nemo_skills/dataset where each folder is a benchmark we support.

Here is how to run evaluation (using API model as an example, but same command works with self-hosted models both locally and on slurm). Make sure that /workspace is mounted inside of your cluster config.

Preparing data

You need to run the following commands to prepare the data.

ns prepare_data

If you're only interested in a subset of datasets (e.g. only math-related or code-related), run with --dataset_groups ... and if you only need a couple of specific datasets, list them directly e.g.

ns prepare_data aime24 aime25 gpqa livecodebench

Note

If you have the repo cloned locally, the data files will be available inside nemo_skills/dataset/<benchmark>/<split>.jsonl and if you installed from pip, they will be downloaded to wherever the repo is installed, which you can figure out by running

python -c "import nemo_skills; print(nemo_skills.__path__)"

Some benchmarks (e.g. ruler) require extra parameters to be passed to the prepare_data script. Thus you'd need to explicitly call ns prepare_data <benchmark name> for them, e.g. for ruler you can use

ns prepare_data ruler --setup=llama_128k --tokenizer_path=meta-llama/Llama-3.1-8B-Instruct --max_seq_length=131072

Running evaluation

ns eval \
    --cluster=local \
    --server_type=openai \
    --model=meta/llama-3.1-8b-instruct \
    --server_address=https://integrate.api.nvidia.com/v1 \
    --benchmarks=gsm8k,human-eval \
    --output_dir=/workspace/test-eval

This will run evaluation on gsm8k and human-eval for Llama 3.1 8B model. If you're running on slurm by default each benchmark is run in a separate job, but you can control this with --num_jobs parameter.

After the evaluation is done, the metrics will be printed to stdout and also saved in <output_dir>/metrics.json.

Note

You can always re-compute and re-print the metrics for all benchmarks in a folder by running

ns summarize_results --cluster=<cluster> <output_dir>

The above command should print the following

---------------------------------------- gsm8k ----------------------------------------
evaluation_mode | num_entries | avg_tokens | gen_seconds | symbolic_correct | no_answer
pass@1          | 1319        | 180        | 164         | 81.96%           | 4.93%


------------------------------------------- human-eval -------------------------------------------
evaluation_mode | num_entries | avg_tokens | gen_seconds | passing_base_tests | passing_plus_tests
pass@1          | 164         | 199        | 29          | 64.63%             | 60.37%

Using multiple samples

You can add :<num repeats> after the benchmark name to repeat evaluation multiple times with high temperature that can be used for majority voting or estimating pass@k. E.g. if we run with

ns eval \
    --cluster=local \
    --server_type=openai \
    --model=meta/llama-3.1-8b-instruct \
    --server_address=https://integrate.api.nvidia.com/v1 \
    --benchmarks gsm8k:4,human-eval:4 \
    --output_dir=/workspace/test-eval

you will see the following output after summarizing results

---------------------------------------- gsm8k -----------------------------------------
evaluation_mode  | num_entries | avg_tokens | gen_seconds | symbolic_correct | no_answer
pass@1[avg-of-4] | 1319        | 180        | 680         | 80.44%           | 6.31%
majority@4       | 1319        | 180        | 680         | 88.40%           | 0.15%
pass@4           | 1319        | 180        | 680         | 93.63%           | 0.15%


-------------------------------------------- human-eval -------------------------------------------
evaluation_mode  | num_entries | avg_tokens | gen_seconds | passing_base_tests | passing_plus_tests
pass@1[avg-of-4] | 164         | 215        | 219         | 64.63%             | 59.30%
pass@4           | 164         | 215        | 219         | 79.27%             | 74.39%

Customizing evaluations

You can customize any part of the evaluation. Here are a few examples

Customize inference parameters

    ++inference.temperature=0.6
    ++inference.top_p=0.8
    ++inference.tokens_to_generate=32768

Customize prompt

You can reference any prompt from nemo_skills/prompt/config without .yaml extension, e.g. to reference nemo_skills/prompt/config/generic/math.yaml use

    ++prompt_config=generic/math
You can also commit a new prompt file to your git repository and reference it as
    ++prompt_config=/nemo_run/code/<path to committed .yaml>

Note that in this case the full path needs to end with .yaml!

Customize evaluation parameters

Different benchmarks have different evaluation options that you can customize. Here is an example of how to adjust code execution timeout for scicode benchmark

    --extra_eval_args="++eval_config.timeout=60"

Using data on cluster

Some benchmarks (e.g. ruler) have very large input datasets and it's inefficient to prepare them on a local machine and keep uploading on cluster with every evaluation job. Instead, you can prepare them on cluster directly. To do that, run prepare_data command with --data_dir and --cluster options, e.g.

ns prepare_data \
    --data_dir=/workspace/ns-data \
    --cluster=slurm \
    ruler --setup llama_128k --tokenizer_path meta-llama/Llama-3.1-8B-Instruct --max_seq_length 131072

Then during evaluation, you'd need to provide the same data_dir argument and it will read the data from cluster directly. You can also use NEMO_SKILLS_DATA_DIR environment variable instead of an explicit argument.

Here is an example evaluation command for ruler that uses Python api and data_dir parameter

from nemo_skills.pipeline.cli import eval, wrap_arguments

eval(
    ctx=wrap_arguments(""),
    cluster="slurm",
    model="/hf_models/Meta-Llama-3.1-8B-Instruct",
    server_type="sglang",
    output_dir="/workspace/eval-ruler",
    data_dir="/workspace/ns-data",
    benchmarks="ruler.llama_128k",
    server_gpus=8,
    expname="eval-ruler",
)

How the benchmarks are defined

Each benchmark exists as a separate folder inside nemo_skills/dataset. Inside those folders there needs to be prepare.py script which can be run to download and format benchmark data into a .jsonl input file (or files if it supports multiple splits) that our scripts can understand. There also needs to be an __init__.py that defines some default variables for that benchmark, such as prompt config, evaluation type, metrics class and a few more.

This information is than used inside eval pipeline to initialize default setup (but all arguments can be changed from the command line).

Let's look at gsm8k to understand a bit more how each part of the evaluation works.

Inside nemo_skills/dataset/gsm8k/__init__.py we see the following

# settings that define how evaluation should be done by default (all can be changed from cmdline)
DATASET_GROUP = 'math'
METRICS_TYPE = "math"
EVAL_ARGS = "++eval_type=math"
GENERATION_ARGS = "++prompt_config=generic/math"

The prompt config and default generation arguments are passed to the nemo_skills/inference/generate.py and the default eval args are passed to the nemo_skills/evaluation/evaluate_results.py. The dataset group is used by nemo_skills/dataset/prepare.py to help download only benchmarks from a particular group if --dataset_groups parameter is used. Finally, the metrics type is used to pick a metrics class from nemo_skills/evaluation/metrics/map_metrics.py which is called at the end of the evaluation to compute final scores.

Adding new benchmarks

To create a new benchmark follow this process:

  1. Create a new folder inside nemo_skills/dataset.
  2. Create prepare.py file that will outputs <split>.jsonl input file(s) in the same folder when run. It can take extra arguments if required.
  3. Create __init__.py file in that folder that container default configuration for that benchmark. You typically need to specify only default prompt config in GENERATION_ARGS and evaluation / metric parameters. But if extra customization is needed for the generation, you can provide a fully custom generation module. See scicode or swe-bench for examples of this.
  4. Create a new evaluation class (if cannot re-use existing one).
  5. Create a new metrics class ( if cannot re-use existing one).