EAGLE Accept Kernels#

size_t trt_edgellm::kernel::getEagleAcceptWorkspaceSize( int32_t batchSize, int32_t numTokens )#

Calculate workspace size required for Eagle accept algorithm.

Parameters:

batchSize – Number of batches to process
numTokens – Number of tokens per batch

Returns:

Required workspace size in bytes

void trt_edgellm::kernel::eagleAccept( rt::Tensor const &logits, rt::Tensor const &tokenIds, rt::Tensor const &attentionMask, rt::Tensor &acceptedTokenIds, rt::Tensor &acceptedLogitsIndices, rt::Tensor &acceptLength, rt::OptionalInputTensor const &vocabMappingTable, void *workspace, size_t workspaceSize, cudaStream_t stream )#

Eagle accept kernel for speculative decoding tree verification.

This kernel implements the eagle accept algorithm that:

Takes logits of shape [batch_size, num_tokens, vocab_size]
Takes a draft tree represented as token_ids [batch_size, num_tokens] and attention mask [batch_size, num_tokens, num_tokens]
Verifies the tree by selecting top-1 tokens and checking attention relationships with depth awareness
Returns accepted token IDs and their corresponding tree indices as 2D tensors [batch_size, max_depth]

Algorithm:

Token 0 is always selected (depth 0)
For each subsequent token, pick top-1 from logits
If vocab mapping table is provided, map selected tokens from reduced vocab to full vocab
Check if the selected token exists at the correct depth in the tree and attends to the previous token
Tree depth is computed from attention mask - tokens at depth d attend to d other tokens
Continue until no valid attention or max depth reached
Batches are processed concurrently using parallel GPU blocks

Optimizations:

Two-stage approach: precompute top-1 tokens separately to reduce shared memory usage
Concurrent batch processing: each batch runs in its own GPU block
Parallel argmax reduction using CUB for finding top-1 tokens (in stage 1)
Parallel token search across threads within each block
Depth-aware token selection to respect tree structure layers
Minimal shared memory allocation (only token depths, not full vocab logits)
Uses provided workspace to avoid dynamic allocation

Note

All tensor parameters must be allocated on GPU device

Note

Workspace must be at least getEagleAcceptWorkspaceSize(batchSize, numTokens) bytes

Note

Shared memory usage: Stage 1: CUB temp storage (~1KB), Stage 2: numTokens * sizeof(int32_t) + small overhead

Note

vocabMappingTable should be provided when base model uses reduced vocabulary

Parameters:

logits – Input logits tensor with shape [batch_size, num_tokens, vocab_size] (FP32, GPU)
tokenIds – Draft tree token IDs with shape [batch_size, num_tokens] (INT32, GPU)
attentionMask – Tree attention mask with shape [batch_size, num_tokens, num_tokens] (INT8, boolean, GPU)
acceptedTokenIds – Output accepted token IDs with shape [batch_size, max_depth] (INT32, GPU)
acceptedLogitsIndices – Output corresponding logits indices with shape [batch_size, max_depth] (INT32, GPU)
acceptLength – Output tensor with accept lengths for each batch with shape [batch_size] (INT32, GPU)
vocabMappingTable – Optional vocab mapping table for reduced vocabulary (INT32, GPU, 1D). Use std::nullopt if not needed.
workspace – Workspace buffer for temporary allocations
workspaceSize – Size of workspace buffer in bytes
stream – CUDA stream for execution