Long-context¶

More details are coming soon!

Supported benchmarks¶

ruler¶

Benchmark is defined in nemo_skills/dataset/ruler/__init__.py
Original benchmark source is here.

mrcr¶

Benchmark is defined in nemo_skills/dataset/mrcr/__init__.py
Original benchmark source is here.

aalcr¶

Benchmark is defined in nemo_skills/dataset/aalcr/__init__.py
Original benchmark source is here and the reported scores by AA is here here.

Data preparation.¶

ns prepare_data \
    --data_dir=/workspace/ns-data \
    --cluster=<cluster_config> \
    aalcr

You can also prepare a subset of the data with limited context window.

    --max_context_window 100000 --setup test_100k

Running evaluation.¶

This setup follows the official AA-LCR implementation. The judge model is Qwen3-235B-A22B-Instruct-2507, and the evaluation is repeated four times.

model=Qwen2.5-7B-Instruct-1M
ns eval \
    --cluster=<cluster_config> \
    --data_dir=/workspace/ns-data \
    --server_gpus=8 \
    --server_type=sglang \
    --model=/hf_models/$model \
    --benchmarks=aalcr:4 \
    --output_dir=/workspace/aalcr/$model \
    --judge_model='/hf_models/Qwen3-235B-A22B-Instruct-2507' \
    --judge_server_type='sglang' \
    --judge_server_gpus=8 \
    --server_args='--disable-cuda-graph' \

The results, including per-category scores, are stored in metrics.json. Detailed breakdowns by category and sequence length are also available via

ns summarize_results --cluster=<cluster_config> <folder_of_output_json>