Parallel Thinking¶
Parallel thinking encompasses methods that scale inference time via parallel sampling. The approach entails primarily two methods:
-
GenSelect is a generative Best-of-N method we introduced in the OpenReasoning paper, followed by a more focused paper -- GenSelect: A Generative Approach to Best-of-N. The method essentially uses an LLM to reason over and select the best candidate solution among the N candidates, leveraging LLMs' comparative strengths while scaling efficiently across parallel sampling budgets.
-
GenSynthesis takes in the input candidate solutions and outputs a new solution with the goal of improving over the input solutions.
Usage¶
We support parallel thinking via the generation pipeline. Pass in the following params for the different parallel thinking modes:
- For GenSelect,
++parallel_thinking.mode=genselect
- For GenSynthesis,
++parallel_thinking.mode=gensynthesis
We support both offline and online parallel thinking:
- Offline mode: The candidate solutions/trajectories have already been generated and can be specified via:
++parallel_thinking.generation_dir=<PATH_TO_GENERATED_DIR>
- Online mode: The candidate solutions need to be generated as part of the generation job.
Note
The parallel thinking pipeline uses the same inference parameters as the generate pipeline. We allow overriding of two key inference config params:
temperature
via++parallel_thinking.temperature=<>
tokens_to_generate
via++parallel_thinking.tokens_to_generate=<>
Common Parameters¶
window_size
: Number of solutions processed in a single parallel thinking input (set to 8 by default). Consider your model's context window size when setting this value (or allow for soft failure via++server.enable_soft_fail=True
).solution_key
: The key from the generation output used to identify the solution content (default:generation
)
Offline Parallel Thinking Parameters¶
These parameters only need to be passed when running offline parallel thinking.
generation_dir
: The directory where the offline generated solutions are stored. We assume the solutions to be inoutput-rs*.jsonl
files.num_initial_solutions
: Number of solutions from the offline generated solutions that are used for parallel thinking.
To specify any of the above variables, say window_size=16
, pass ++parallel_thinking.window_size=16
to the generate/eval pipelines.
Sample Examples¶
Online Parallel Thinking (via GenSynthesis)¶
In this example, we show how to use GenSynthesis for aime25 with Qwen/Qwen3-8B.
ns eval \
--benchmarks aime25 \
--cluster local \
--model Qwen/Qwen3-8B \
--server_gpus 2 \
--server_type vllm \
--output_dir /experiments/qwen3_8b/gensynthesis \
++inference.tokens_to_generate=16384 \
++parallel_thinking.mode=gensynthesis \
++server.enable_soft_fail=True \
++server.context_limit_retry_strategy=reduce_generation
The evaluation pipeline would first generate window_size
solutions (8 by default), and then run GenSynthesis with these solutions in the prompt to synthesize a new solution.
Note that the same model is being used for both solution generation and synthesis, which we refer to as Self-GenSynthesis.
Tip
Parallel Thinking inputs can consume a lot of tokens, especially for large window_size
values.
To avoid running into context length issues, we recommend running these pipelines with ++server.enable_soft_fail=True
, as in the above command.
To use methods for retrying generation with reduced prompt/length, we recommend trying out the context reduction strategies supported.
In the above example, we use:
++server.enable_soft_fail=True ++server.context_limit_retry_strategy=reduce_generation
which reduces the generation budget when context limit exceeds.
Offline Parallel Thinking (via GenSelect)¶
Offline parallel thinking breaks down the candidate generation and processing (selection/synthesis) part into two separate steps. There are two use cases which are currently only supported with offline parallel thinking:
- Using a different model for generation and processing
- Using a processed version of generated outputs as input to parallel thinking
In the following example, we use Qwen/Qwen3-8B to perform GenSelect over solutions generated by Qwen/Qwen3-4B for livecodebench
.
from nemo_skills.pipeline.cli import eval, wrap_arguments
# Generate initial solutions
eval(
ctx=wrap_arguments(
"++inference.tokens_to_generate=16384 "
"++inference.temperature=0.6 "
),
cluster="local",
benchmarks="livecodebench:8",
output_dir="/workspace/qwen3_4b_evals/",
server_type="vllm",
server_gpus=1,
model="Qwen/Qwen3-4B",
expname="initial-soln-qwen3-4b-livecodebench"
)
# Run parallel thinking on initial solutions
# Using GenSelect with Qwen3-8B
eval(
ctx=wrap_arguments(
"++parallel_thinking.tokens_to_generate=16384 "
"++parallel_thinking.temperature=0.6 "
"++parallel_thinking.mode=genselect "
"++parallel_thinking.solution_key=completion "
"++parallel_thinking.generation_dir=/workspace/qwen3_4b_evals/eval-results/livecodebench "
),
cluster="local",
benchmarks="livecodebench:8",
output_dir="/workspace/qwen3_4b_evals/genselect_qwen3_8b",
server_type="vllm",
server_gpus=2,
model="Qwen/Qwen3-8B",
run_after="initial-soln-qwen3-4b-livecodebench",
expname="parallel-thinking-qwen3-8b-livecodebench"
)
There are three things we want to highlight in the above example:
- We run the GenSelect step a total of 8 times (
livecodebench:8
) over the same set of solutions - The pre-generated solutions are specified via:
++parallel_thinking.generation_dir=/workspace/qwen3_4b_evals/eval-results/livecodebench
- Instead of the usual
generation
key for identifying the solution content, we use++parallel_thinking.solution_key=completion
The completion
key in livecodebench
outputs contains just the extracted code from the generated solutions. For coding tasks, we empirically find that representing the candidate solution with just the extracted code performs better than representing it with the text around it.