Robustness Evaluation¶

robust_eval is built on top of ns eval to evaluate the model on multiple benchmarks using different prompt variations.
The purpose is to measure and analyze model robustness against changes in the prompt.

How to Use `robust_eval`¶

The usage is nearly identical to the standard ns eval, but includes an additional required argument prompt_set_config to specify the set of prompts per benchmark to evaluate on. All other arguments are standard ns eval arguments that will be used during evaluation. Any argument of ns eval can also be passed here. And any benchmark supported by Nemo-Skills is also supported here.

The prompt_set_config.yaml file must have a benchmark name as a key and a list of paths to the prompt configuration files as its value. Note that the benchmark names provided in the --benchmarks argument must match the keys in the prompt_set_config file (example in nemo_skills/prompt/config/robustness/prompt_set_config.yaml)

gpqa:
  - <path_to_prompt_1>
  - <path_to_prompt_2>
  ...
comp-math-24-25:
  - <path_to_prompt_1>
  - <path_to_prompt_2>
  ...

Run Command Example¶

This will launch an ns eval on GPQA and comp-math-24-25 for every prompt specified in prompt_set_config.yaml, across 16 seeds (because of benchmark:16, but the random seeds can be different for every benchmark).
nemo_skills/prompt/config/robustness/prompt_set_config.yaml already contains 20 prompt paths for GPQA and Comp-Math-24-25.
Note that every prompt is a separate job, and all parameters are shared for all jobs. For example, if num_jobs is specified, it will launch num_jobs jobs per prompt, not overall.

from nemo_skills.pipeline.cli import wrap_arguments, robust_eval
robust_eval(ctx=wrap_arguments(
        f"++inference.temperature=0.6 "
        f"++inference.top_p=0.95 "
    ),
    prompt_set_config='robustness/prompt_set_config', # OR nemo_skills/prompt/config/robutness/prompt_set_config OR absolute path to .yaml file
    cluster=cluster_config,
    model="Qwen/Qwen3-8B",
    server_type='vllm',
    output_dir="/workspace/robustness_eval/Qwen3-8B/",
    benchmarks="gpqa:16,comp-math-24-25:16",
    server_gpus=2,
    server_nodes=1,
    expname='test',
)

An example of the expected output_dir structure:

output_dir/
├── gpqa/
│   ├── prompt1/
│   │   ├── output-rs0.jsonl
│   │   └── output-rs1.jsonl
│   └── prompt2/
│       ├── output-rs0.jsonl
│       └── output-rs1.jsonl
└── comp-math-24-25/
    └── ...

Summarize Robustness¶

When all evaluations are done, summarize_robustness is automatically launched to process the generated files and produce aggregated metrics. The following metrics are calculated:

Aggregated Benchmark Statistics: For each benchmark across all prompts and seeds, the script calculates:
- min, max, avg, std: Statistical metrics across all runs per benchmark.
- CR (Consistency Rate): The average rate of agreement of model predictions on the same datapoint across different runs.
- prompt_sensitivity: The standard deviation of the average scores across different prompts, which measures how sensitive the model's accuracy is to prompt variations.
Per-Prompt Statistics: For each prompt across all random seeds, the script calculates:
- min, max, avg, std: Statistical metrics for a single prompt across seeds.
- CR (Consistency Rate): The average rate of agreement of model predictions on the same question across different runs.
- no_answer: The proportion of questions for which the model did not provide an answer (can be used to find prompts that break the model predictions).

An example of the output file generated by summarize_robustness in output_dir/summarize_robustness/main*.log. All calculated metrics are also saved to a output_dir/metrics.json file.
First, all evaluations are aggregated across benchmarks. Then, there is a breakdown per benchmark and per prompt across 16 seeds.

dataset              |   min   |   max   |   avg   |   std   |    CR   | prompt_sensitivity
-------------------------------------------------------------------------------------------
comp-math-24-25@32   |  48.05  |  53.91  |  51.10  |   1.60  |  55.25  |   0.34
gpqa@32              |  50.51  |  60.61  |  55.51  |   2.44  |  65.15  |   0.77


------------------------------------- comp-math-24-25 -------------------------------------
prompt@16            |   min   |   max   |   avg   |   std   |    CR   | no_answer
----------------------------------------------------------------------------------
prompt_1             |  48.05  |  53.91  |  50.76  |   1.61  |  55.48  |   1.56
...
prompt_21            |  48.44  |  53.91  |  51.44  |   1.52  |  55.13  |   1.66


-------------------------------------- gpqa --------------------------------------
prompt@16            |   min   |   max   |   avg   |   std   |    CR   | no_answer
----------------------------------------------------------------------------------
prompt_1             |  50.51  |  60.61  |  54.73  |   2.68  |  64.20  |   3.03
...
prompt_21            |  53.54  |  60.10  |  56.28  |   1.88  |  66.59  |   2.78

Consistency Rate¶

For each datapoint, collect all predictions and calculate the similarity between all possible pairs of predictions. The consistency rate is the number of pairs of equivalent prediction pairs divided by the total number of prediction pairs (N choose 2).
Example: For a datapoint with predictions [A, A, C] across 3 files, it will compare pairs (A, A), (A, C), and (A, C), and the consistency rate will be 1/3 = 33.33%.
Consistency rate is proposed in Improving the Robustness of Large Language Models via Consistency Alignment.

Notes on Usage¶

There are 20 Math and 20 MCQ prompts in the prompt/config/robustness folder, along with the prompt_set_config.yaml. All prompts except for the MCQ AAI prompt require \boxed{} format for the answer. These can be used with any Math (AIME, comp-math-24-25, etc) and MCQ (GPQA, MMLU-Pro, etc) benchmarks.
The robust_eval can be used with any dataset that Nemo-Skills supports, but summarize_robustness works on Math and MCQ datasets (for now). If you need evaluations on multiple prompts, you can still use robust_eval. However, the summarize_robustness part won't work.