EAGLE Accept Kernels#

size_t trt_edgellm::kernel::getEagleAcceptWorkspaceSize(
int32_t batchSize,
int32_t numTokens
)#

Calculate workspace size required for Eagle accept algorithm.

Parameters:
  • batchSize – Number of batches to process

  • numTokens – Number of tokens per batch

Returns:

Required workspace size in bytes

void trt_edgellm::kernel::eagleAccept(
rt::Tensor const &logits,
rt::Tensor const &tokenIds,
rt::Tensor const &attentionMask,
rt::Tensor &acceptedTokenIds,
rt::Tensor &acceptedLogitsIndices,
rt::Tensor &acceptLength,
rt::OptionalInputTensor const &vocabMappingTable,
void *workspace,
size_t workspaceSize,
cudaStream_t stream
)#

Eagle accept kernel for speculative decoding tree verification.

This kernel implements the eagle accept algorithm that:

  1. Takes logits of shape [batch_size, num_tokens, vocab_size]

  2. Takes a draft tree represented as token_ids [batch_size, num_tokens] and attention mask [batch_size, num_tokens, num_tokens]

  3. Verifies the tree by selecting top-1 tokens and checking attention relationships with depth awareness

  4. Returns accepted token IDs and their corresponding tree indices as 2D tensors [batch_size, max_depth]

Algorithm:

  • Token 0 is always selected (depth 0)

  • For each subsequent token, pick top-1 from logits

  • If vocab mapping table is provided, map selected tokens from reduced vocab to full vocab

  • Check if the selected token exists at the correct depth in the tree and attends to the previous token

  • Tree depth is computed from attention mask - tokens at depth d attend to d other tokens

  • Continue until no valid attention or max depth reached

  • Batches are processed concurrently using parallel GPU blocks

Optimizations:

  • Two-stage approach: precompute top-1 tokens separately to reduce shared memory usage

  • Concurrent batch processing: each batch runs in its own GPU block

  • Parallel argmax reduction using CUB for finding top-1 tokens (in stage 1)

  • Parallel token search across threads within each block

  • Depth-aware token selection to respect tree structure layers

  • Minimal shared memory allocation (only token depths, not full vocab logits)

  • Uses provided workspace to avoid dynamic allocation

Note

All tensor parameters must be allocated on GPU device

Note

Workspace must be at least getEagleAcceptWorkspaceSize(batchSize, numTokens) bytes

Note

Shared memory usage: Stage 1: CUB temp storage (~1KB), Stage 2: numTokens * sizeof(int32_t) + small overhead

Note

vocabMappingTable should be provided when base model uses reduced vocabulary

Parameters:
  • logits – Input logits tensor with shape [batch_size, num_tokens, vocab_size] (FP32, GPU)

  • tokenIds – Draft tree token IDs with shape [batch_size, num_tokens] (INT32, GPU)

  • attentionMask – Tree attention mask with shape [batch_size, num_tokens, num_tokens] (INT8, boolean, GPU)

  • acceptedTokenIds – Output accepted token IDs with shape [batch_size, max_depth] (INT32, GPU)

  • acceptedLogitsIndices – Output corresponding logits indices with shape [batch_size, max_depth] (INT32, GPU)

  • acceptLength – Output tensor with accept lengths for each batch with shape [batch_size] (INT32, GPU)

  • vocabMappingTable – Optional vocab mapping table for reduced vocabulary (INT32, GPU, 1D). Use std::nullopt if not needed.

  • workspace – Workspace buffer for temporary allocations

  • workspaceSize – Size of workspace buffer in bytes

  • stream – CUDA stream for execution