Skip to content

Long-context

More details are coming soon!

Supported benchmarks

ruler

mrcr

aalcr

Data preparation.

ns prepare_data \
    --data_dir=/workspace/ns-data \
    --cluster=<cluster_config> \
    aalcr
You can also prepare a subset of the data with limited context window.
    --max_context_window 100000 --setup test_100k

Running evaluation.

This setup follows the official AA-LCR implementation. The judge model is Qwen3-235B-A22B-Instruct-2507, and the evaluation is repeated four times.

model=Qwen2.5-7B-Instruct-1M
ns eval \
    --cluster=<cluster_config> \
    --data_dir=/workspace/ns-data \
    --server_gpus=8 \
    --server_type=sglang \
    --model=/hf_models/$model \
    --benchmarks=aalcr:4 \
    --output_dir=/workspace/aalcr/$model \
    --judge_model='/hf_models/Qwen3-235B-A22B-Instruct-2507' \
    --judge_server_type='sglang' \
    --judge_server_gpus=8 \
    --server_args='--disable-cuda-graph' \
The results, including per-category scores, are stored in metrics.json. Detailed breakdowns by category and sequence length are also available via
ns summarize_results --cluster=<cluster_config> <folder_of_output_json>