Scientific knowledge¶
More details are coming soon!
Supported benchmarks¶
hle¶
- Benchmark is defined in
nemo_skills/dataset/hle/__init__.py
- Original benchmark source is here.
scicode¶
Note
For scicode by default we evaluate on the combined dev + test split (containing 80 problems and 338 subtasks) for consistency with
AAI evaluation methodology. If you want to only evaluate on the
test set, use --split=test
.
- Benchmark is defined in
nemo_skills/dataset/scicode/__init__.py
- Original benchmark source is here.
gpqa¶
- Benchmark is defined in
nemo_skills/dataset/gpqa/__init__.py
- Original benchmark source is here.
mmlu-pro¶
- Benchmark is defined in
nemo_skills/dataset/mmlu-pro/__init__.py
- Original benchmark source is here.
mmlu¶
- Benchmark is defined in
nemo_skills/dataset/mmlu/__init__.py
- Original benchmark source is here.
mmlu-redux¶
- Benchmark is defined in
nemo_skills/dataset/mmlu-redux/__init__.py
- Original benchmark source is here.