2025¶

August 29, 2025
5 min read

Inference with gpt-oss-120b using stateful Python code execution

In this tutorial, you will learn how to run inference with gpt-oss-120b model using the built-in stateful Python code execution.

We will first reproduce the evaluation results on AIME24 and AIME25 benchmarks (hitting 100% with majority voting!) and then extend this to run arbitrary synthetic data generation with or without Python tool use.

----------------------------------------- aime24 ----------------------------------------
evaluation_mode   | num_entries | avg_tokens | gen_seconds | symbolic_correct | no_answer
pass@1[avg-of-16] | 30          | 13306      | 1645        | 96.46%           | 0.42%
majority@16       | 30          | 13306      | 1645        | 100.00%          | 0.00%
pass@16           | 30          | 13306      | 1645        | 100.00%          | 0.00%

----------------------------------------- aime25 ----------------------------------------
evaluation_mode   | num_entries | avg_tokens | gen_seconds | symbolic_correct | no_answer
pass@1[avg-of-16] | 30          | 14463      | 1717        | 96.67%           | 0.83%
majority@16       | 30          | 14463      | 1717        | 100.00%          | 0.00%
pass@16           | 30          | 14463      | 1717        | 100.00%          | 0.00%

August 28, 2025
20 min read

Building an Efficient Inference Engine for Solving Math Problems

This tutorial guides you through creating a high-performance inference engine using NeMo-Skills to tackle complex math problems. It demonstrates the inference pipeline used to win the AIMO-2 competition. With FP8 quantization and ReDrafter speculative decoding, we demonstrate up to 4× faster batched inference compared to BF16 on two H100 GPUs.

We will leverage TensorRT-LLM for optimized model serving, including an advanced technique called ReDrafter for speculative decoding.

By the end of this tutorial and companion notebook, you will have a local setup capable of running efficient inference with a large language model (LLM) integrated with a code execution sandbox.

August 22, 2025
10 min read

Reproducing NVIDIA-Nemotron-Nano-9B-v2 Evals

In this tutorial, we will reproduce the evals for the NVIDIA-Nemotron-Nano-9B-v2 model using NeMo-Skills. For an introduction to the NeMo-Skills framework, we recommend going over our introductory tutorial.

We assume you have /workspace defined in your cluster config and are executing all commands from that folder locally. Change all commands accordingly if running on slurm or using different paths.

August 15, 2025
15 min read

Reproducing Llama-Nemotron-Super-49B-V1.5 Evals

In this tutorial, we will reproduce the evals for the Llama-3.3-Nemotron-Super-49B-v1.5 model using NeMo-Skills. For an introduction to the NeMo-Skills framework, we recommend going over our introductory tutorial.

We assume you have /workspace defined in your cluster config and are executing all commands from that folder locally. Change all commands accordingly if running on slurm or using different paths.

July 10, 2025
20 min read

A Simple Pipeline to Improve Math Reasoning Accuracy

This tutorial walks you through a simplified version of the pipeline that we used to win the AIMO2 Kaggle competition. We will start with Qwen2.5-14B-Instruct model that only scores ~10% on AIME24 benchmark and improve it to ~30% through a series of NeMo-Skills jobs.

If you’re following along, you’ll need access to either an NVIDIA DGX box with eight NVIDIA A100 (or newer) GPUs or a Slurm cluster with similarly configured nodes. All commands should only take ~2 hours to run.