Skip to content

Scientific knowledge

More details are coming soon!

Supported benchmarks

hle

scicode

Note

For scicode by default we evaluate on the combined dev + test split (containing 80 problems and 338 subtasks) for consistency with AAI evaluation methodology. If you want to only evaluate on the test set, use --split=test.

gpqa

mmlu-pro

mmlu

mmlu-redux