Skip to content

Scientific knowledge

More details are coming soon!

Supported benchmarks

hle

  • Benchmark is defined in nemo_skills/dataset/hle/__init__.py
  • Original benchmark source is here.
  • The text split includes all non-image examples. It is further divided into eng, chem, bio, cs, phy, math, human, other. Currently, all of these splits contain only text data.

SimpleQA

  • Benchmark is defined in nemo_skills/dataset/simpleqa/__init__.py
  • Original benchmark source code for SimpleQA (OpenAI) is here and the leaderboard is here. An improved version with 1,000 examples from Google, SimpleQA-verified, is here.
  • To use the SimpleQA-verified, set split=verified. To use the original version of SimpleQA, please set split=test.

scicode

Note

For scicode by default we evaluate on the combined dev + test split (containing 80 problems and 338 subtasks) for consistency with AAI evaluation methodology. If you want to only evaluate on the test set, use --split=test.

gpqa

mmlu-pro

mmlu

mmlu-redux