Scientific knowledge¶

More details are coming soon!

Supported benchmarks¶

hle¶

Benchmark is defined in nemo_skills/dataset/hle/__init__.py
Original benchmark source is here.
The text split includes all non-image examples. It is further divided into eng, chem, bio, cs, phy, math, human, other. Currently, all of these splits contain only text data.

SimpleQA¶

Benchmark is defined in nemo_skills/dataset/simpleqa/__init__.py
Original benchmark source code for SimpleQA (OpenAI) is here and the leaderboard is here. An improved version with 1,000 examples from Google, SimpleQA-verified, is here.
To use the SimpleQA-verified, set split=verified. To use the original version of SimpleQA, please set split=test.

scicode¶

Note

For scicode by default we evaluate on the combined dev + test split (containing 80 problems and 338 subtasks) for consistency with AAI evaluation methodology. If you want to only evaluate on the test set, use --split=test.

Benchmark is defined in nemo_skills/dataset/scicode/__init__.py
Original benchmark source is here.

gpqa¶

Benchmark is defined in nemo_skills/dataset/gpqa/__init__.py
Original benchmark source is here.

mmlu-pro¶

Benchmark is defined in nemo_skills/dataset/mmlu-pro/__init__.py
Original benchmark source is here.

mmlu¶

Benchmark is defined in nemo_skills/dataset/mmlu/__init__.py
Original benchmark source is here.

mmlu-redux¶

Benchmark is defined in nemo_skills/dataset/mmlu-redux/__init__.py
Original benchmark source is here.