Scientific knowledge¶
More details are coming soon!
Supported benchmarks¶
hle¶
- Benchmark is defined in
nemo_skills/dataset/hle/__init__.py
- Original benchmark source is here.
- The
text
split includes all non-image examples. It is further divided intoeng
,chem
,bio
,cs
,phy
,math
,human
,other
. Currently, all of these splits contain only text data.
SimpleQA¶
- Benchmark is defined in
nemo_skills/dataset/simpleqa/__init__.py
- Original benchmark source code for SimpleQA (OpenAI) is here and the leaderboard is here. An improved version with 1,000 examples from Google, SimpleQA-verified, is here.
- To use the SimpleQA-verified, set
split=verified
. To use the original version of SimpleQA, please setsplit=test
.
scicode¶
Note
For scicode by default we evaluate on the combined dev + test split (containing 80 problems and 338 subtasks) for consistency with
AAI evaluation methodology. If you want to only evaluate on the
test set, use --split=test
.
- Benchmark is defined in
nemo_skills/dataset/scicode/__init__.py
- Original benchmark source is here.
gpqa¶
- Benchmark is defined in
nemo_skills/dataset/gpqa/__init__.py
- Original benchmark source is here.
mmlu-pro¶
- Benchmark is defined in
nemo_skills/dataset/mmlu-pro/__init__.py
- Original benchmark source is here.
mmlu¶
- Benchmark is defined in
nemo_skills/dataset/mmlu/__init__.py
- Original benchmark source is here.
mmlu-redux¶
- Benchmark is defined in
nemo_skills/dataset/mmlu-redux/__init__.py
- Original benchmark source is here.