Long-context¶
More details are coming soon!
Supported benchmarks¶
ruler¶
- Benchmark is defined in
nemo_skills/dataset/ruler/__init__.py
- Original benchmark source is here.
mrcr¶
- Benchmark is defined in
nemo_skills/dataset/mrcr/__init__.py
- Original benchmark source is here.
aalcr¶
- Benchmark is defined in
nemo_skills/dataset/aalcr/__init__.py
- Original benchmark source is here and the reported scores by AA is here here.
Data preparation.¶
You can also prepare a subset of the data with limited context window.Running evaluation.¶
This setup follows the official AA-LCR implementation. The judge model is Qwen3-235B-A22B-Instruct-2507, and the evaluation is repeated four times.
model=Qwen2.5-7B-Instruct-1M
ns eval \
--cluster=<cluster_config> \
--data_dir=/workspace/ns-data \
--server_gpus=8 \
--server_type=sglang \
--model=/hf_models/$model \
--benchmarks=aalcr:4 \
--output_dir=/workspace/aalcr/$model \
--judge_model='/hf_models/Qwen3-235B-A22B-Instruct-2507' \
--judge_server_type='sglang' \
--judge_server_gpus=8 \
--server_args='--disable-cuda-graph' \