Model evaluation¶
Here are the commands you can run to reproduce our evaluation numbers.
We assume you have /workspace
defined in your cluster config and are
executing all commands from that folder locally. Change all commands accordingly
if running on slurm or using different paths.
Download models¶
Get the models from HF. E.g.
To evaluate HLE we used Qwen2.5-32B-Instruct model as a judge. You will need to download it as well if you want to reproduce HLE numbers
Prepare evaluation data¶
Run evaluation¶
Note
The current script only supports GenSelect evaluation for math benchmarks. We will add instructions and commands for GenSelect for code and science in the next few days.
We provide an evaluation script in recipes/openreasoning/eval.py. It will run evaluation on all benchmarks and for all 4 model sizes. You can modify it directly to change evaluation settings or to only evaluate a subset of models / benchmarks.
After the evaluation is finished, you can find metrics.json
files in each benchmark folders with full scores.
To view GenSelect scores additionally run the following commands for each benchmark and model size. E.g. for 14B and hmmt_feb25
benchmark, run
ns summarize_results /workspace/open-reasoning-evals/14B-genselect/hmmt_feb25/math/ --metric_type math
which should print the following scores. Here majority@64
is the number we are looking for.
Note that this is majority across GenSelect runs, not original generations.