EAGLE Accept Kernels#
- size_t trt_edgellm::kernel::getEagleAcceptWorkspaceSize(
- int32_t batchSize,
- int32_t numTokens
Calculate workspace size required for Eagle accept algorithm.
- Parameters:
batchSize – Number of batches to process
numTokens – Number of tokens per batch
- Returns:
Required workspace size in bytes
- void trt_edgellm::kernel::eagleAccept(
- rt::Tensor const &logits,
- rt::Tensor const &tokenIds,
- rt::Tensor const &attentionMask,
- rt::Tensor &acceptedTokenIds,
- rt::Tensor &acceptedLogitsIndices,
- rt::Tensor &acceptLength,
- rt::OptionalInputTensor const &vocabMappingTable,
- void *workspace,
- size_t workspaceSize,
- cudaStream_t stream
Eagle accept kernel for speculative decoding tree verification.
This kernel implements the eagle accept algorithm that:
Takes logits of shape [batch_size, num_tokens, vocab_size]
Takes a draft tree represented as token_ids [batch_size, num_tokens] and attention mask [batch_size, num_tokens, num_tokens]
Verifies the tree by selecting top-1 tokens and checking attention relationships with depth awareness
Returns accepted token IDs and their corresponding tree indices as 2D tensors [batch_size, max_depth]
Algorithm:
Token 0 is always selected (depth 0)
For each subsequent token, pick top-1 from logits
If vocab mapping table is provided, map selected tokens from reduced vocab to full vocab
Check if the selected token exists at the correct depth in the tree and attends to the previous token
Tree depth is computed from attention mask - tokens at depth d attend to d other tokens
Continue until no valid attention or max depth reached
Batches are processed concurrently using parallel GPU blocks
Optimizations:
Two-stage approach: precompute top-1 tokens separately to reduce shared memory usage
Concurrent batch processing: each batch runs in its own GPU block
Parallel argmax reduction using CUB for finding top-1 tokens (in stage 1)
Parallel token search across threads within each block
Depth-aware token selection to respect tree structure layers
Minimal shared memory allocation (only token depths, not full vocab logits)
Uses provided workspace to avoid dynamic allocation
Note
All tensor parameters must be allocated on GPU device
Note
Workspace must be at least getEagleAcceptWorkspaceSize(batchSize, numTokens) bytes
Note
Shared memory usage: Stage 1: CUB temp storage (~1KB), Stage 2: numTokens * sizeof(int32_t) + small overhead
Note
vocabMappingTable should be provided when base model uses reduced vocabulary
- Parameters:
logits – Input logits tensor with shape [batch_size, num_tokens, vocab_size] (FP32, GPU)
tokenIds – Draft tree token IDs with shape [batch_size, num_tokens] (INT32, GPU)
attentionMask – Tree attention mask with shape [batch_size, num_tokens, num_tokens] (INT8, boolean, GPU)
acceptedTokenIds – Output accepted token IDs with shape [batch_size, max_depth] (INT32, GPU)
acceptedLogitsIndices – Output corresponding logits indices with shape [batch_size, max_depth] (INT32, GPU)
acceptLength – Output tensor with accept lengths for each batch with shape [batch_size] (INT32, GPU)
vocabMappingTable – Optional vocab mapping table for reduced vocabulary (INT32, GPU, 1D). Use std::nullopt if not needed.
workspace – Workspace buffer for temporary allocations
workspaceSize – Size of workspace buffer in bytes
stream – CUDA stream for execution