Inference with gpt-oss-120b using stateful Python code execution
In this tutorial, you will learn how to run inference with gpt-oss-120b model using the built-in stateful Python code execution.
We will first reproduce the evaluation results on AIME24 and AIME25 benchmarks (hitting 100% with majority voting!) and then extend this to run arbitrary synthetic data generation with or without Python tool use.
----------------------------------------- aime24 ----------------------------------------
evaluation_mode | num_entries | avg_tokens | gen_seconds | symbolic_correct | no_answer
pass@1[avg-of-16] | 30 | 13306 | 1645 | 96.46% | 0.42%
majority@16 | 30 | 13306 | 1645 | 100.00% | 0.00%
pass@16 | 30 | 13306 | 1645 | 100.00% | 0.00%
----------------------------------------- aime25 ----------------------------------------
evaluation_mode | num_entries | avg_tokens | gen_seconds | symbolic_correct | no_answer
pass@1[avg-of-16] | 30 | 14463 | 1717 | 96.67% | 0.83%
majority@16 | 30 | 14463 | 1717 | 100.00% | 0.00%
pass@16 | 30 | 14463 | 1717 | 100.00% | 0.00%