Skip to content

OpenReasoning

We released OpenReasoning-Nemotrons: a suite of reasoning-capable large language models (LLMs) which have been distilled from the DeepSeek R1 0528 671B model. Trained on a massive, high-quality dataset distilled from the new DeepSeek R1 0528, our new 7B, 14B, and 32B models achieve state-of-the-art performance on a wide range of reasoning benchmarks for their respective sizes in the domain of mathematics, science and code. The models are available to download from Hugging Face (1.5B, 7B, 14B, 32B).

The foundation of these models is their dataset. We generated 5 million high-quality reasoning-based solutions by leveraging the powerful DeepSeek R1 0528 model across the domains of mathematics, coding, and science. This dataset will be released in the coming months, enabling all models to improve their reasoning capabilities on these domains.

Evaluation results

Evaluation Results with pass@1

Our models demonstrate exceptional performance across a suite of challenging reasoning benchmarks. The 7B, 14B, and 32B models consistently set new state-of-the-art records for their size classes.

Model AritificalAnalysisIndex* GPQA MMLU-PRO HLE LiveCodeBench* SciCode AIME24 AIME25 HMMT FEB 25
1.5B 31.0 31.6 47.5 5.5 28.6 2.2 55.5 45.6 31.5
7B 54.7 61.1 71.9 8.3 63.3 16.2 84.7 78.2 63.5
14B 60.9 71.6 77.5 10.1 67.8 23.5 87.8 82.0 71.2
32B 64.3 73.1 80.0 11.9 70.2 28.5 89.2 84.0 73.8

* This is our estimation of the Artificial Analysis Intelligence Index, not an official score.

* LiveCodeBench version 6, date range 2408-2505.

Combining the work of multiple agents

OpenReasoning-Nemotron models can be used in a "heavy" mode by starting multiple parallel generations and combining them together via generative solution selection (GenSelect). To add this "skill" we follow the original GenSelect training pipeline except we do not train on the selection summary but use the full reasoning trace of DeepSeek R1 0528 671B instead. We only train models to select the best solution for math problems but surprisingly find that this capability directly generalizes to code and science questions! With this "heavy" GenSelect inference mode, OpenReasoning-Nemotron-32B model surpasses O3 (High) on math and coding benchmarks.

Evaluation Results with GenSelect

Model Pass@1 (Avg@64) Majority@64 GenSelect
1.5B
AIME24 55.5 76.7 76.7
AIME25 45.6 70.0 70.0
HMMT Feb 25 31.5 46.7 53.3
7B
AIME24 84.7 93.3 93.3
AIME25 78.2 86.7 93.3
HMMT Feb 25 63.5 83.3 90.0
LCB v6 2408-2505 63.4 n/a 67.7
14B
AIME24 87.8 93.3 93.3
AIME25 82.0 90.0 90.0
HMMT Feb 25 71.2 86.7 93.3
LCB v6 2408-2505 67.9 n/a 69.1
32B
AIME24 89.2 93.3 93.3
AIME25 84.0 90.0 93.3
HMMT Feb 25 73.8 86.7 96.7
LCB v6 2408-2505 70.2 n/a 75.3
HLE 11.8 13.4 15.5

How to reproduce our results

Browse the sections below to see all commands needed to fully reproduce our results.

Please note that unless you have an access to a large GPU cluster, it might take a very long time for some of the commands to complete!