Inference with gpt-oss-120b using stateful Python code execution

In this tutorial, you will learn how to run inference with gpt-oss-120b model using the built-in stateful Python code execution.

We will first reproduce the evaluation results on AIME24 and AIME25 benchmarks (hitting 100% with majority voting!) and then extend this to run arbitrary synthetic data generation with or without Python tool use.

----------------------------------------- aime24 ----------------------------------------
evaluation_mode   | num_entries | avg_tokens | gen_seconds | symbolic_correct | no_answer
pass@1[avg-of-16] | 30          | 13306      | 1645        | 96.46%           | 0.42%
majority@16       | 30          | 13306      | 1645        | 100.00%          | 0.00%
pass@16           | 30          | 13306      | 1645        | 100.00%          | 0.00%

----------------------------------------- aime25 ----------------------------------------
evaluation_mode   | num_entries | avg_tokens | gen_seconds | symbolic_correct | no_answer
pass@1[avg-of-16] | 30          | 14463      | 1717        | 96.67%           | 0.83%
majority@16       | 30          | 14463      | 1717        | 100.00%          | 0.00%
pass@16           | 30          | 14463      | 1717        | 100.00%          | 0.00%

Building an Efficient Inference Engine for Solving Math Problems

This tutorial guides you through creating a high-performance inference engine using NeMo-Skills to tackle complex math problems. It demonstrates the inference pipeline used to win the AIMO-2 competition. With FP8 quantization and ReDrafter speculative decoding, we demonstrate up to 4× faster batched inference compared to BF16 on two H100 GPUs.

We will leverage TensorRT-LLM for optimized model serving, including an advanced technique called ReDrafter for speculative decoding.

By the end of this tutorial and companion notebook, you will have a local setup capable of running efficient inference with a large language model (LLM) integrated with a code execution sandbox.

Reproducing Llama-Nemotron-Super-49B-V1.5 Evals

In this tutorial, we will reproduce the evals for the Llama-3.3-Nemotron-Super-49B-v1.5 model using NeMo-Skills. For an introduction to the NeMo-Skills framework, we recommend going over our introductory tutorial.

We assume you have /workspace defined in your cluster config and are executing all commands from that folder locally. Change all commands accordingly if running on slurm or using different paths.

A Simple Pipeline to Improve Math Reasoning Accuracy

This tutorial walks you through a simplified version of the pipeline that we used to win the AIMO2 Kaggle competition. We will start with Qwen2.5-14B-Instruct model that only scores ~10% on AIME24 benchmark and improve it to ~30% through a series of NeMo-Skills jobs.

If you’re following along, you’ll need access to either an NVIDIA DGX box with eight NVIDIA A100 (or newer) GPUs or a Slurm cluster with similarly configured nodes. All commands should only take ~2 hours to run.