EAGLE Draft Engine Runner#
-
class EagleDraftEngineRunner#
Eagle Draft Engine Runner class for speculative decoding.
Public Functions
- EagleDraftEngineRunner(
- std::filesystem::path const &enginePath,
- std::filesystem::path const &configPath,
- cudaStream_t stream
Construct an Eagle Draft Engine Runner.
- Parameters:
enginePath – Path to the TensorRT engine file
configPath – Path to the configuration JSON file
stream – CUDA stream for initialization
-
~EagleDraftEngineRunner()#
Destructor.
-
rt::Tensor &getRopeCosSinCacheTensor()#
Get internal RoPE cosine/sine cache tensor for the eagle draft engine.
- Returns:
Reference to the RoPE cosine/sine cache tensor
-
rt::LinearKVCache &getLinearKVCache()#
Get internal linear KV cache for the eagle draft engine.
- Returns:
Reference to the linear KV cache
-
EagleDraftEngineRunnerConfig getDraftEngineConfig() const#
Get the draft engine configuration.
- Returns:
The draft engine configuration structure
- bool executeEaglePrefillStep(
- rt::Tensor const &inputIds,
- rt::Tensor const &baseModelHiddenStates,
- rt::Tensor const &draftModelHiddenStates,
- rt::Tensor const &contextLengths,
- rt::OptionalInputTensor multimodalEmbeddings,
- rt::Tensor &outputLogits,
- rt::Tensor &outputHiddenStates,
- rt::Tensor const &baseRopeCosSinCache,
- cudaStream_t stream
API entry to execute prefill step for the eagle draft engine.
By definition, eagle operates on feature level with formulation of f_n = F_proj(f_{n}, token_{n+1}). The API will takes hidden states input from base model and token_ids of [1 ~ N] as input, output logits and (draft) hidden states for the “last entry” to be used in following draft proposal step. Multi-batch is supported - each batch can have different actual sequence length (with padding).
- Parameters:
inputIds – [GPU, Int32] Input token_ids for the draft model with shape [batch_size, N_padded] denoting the token_ids of [1 ~N]
baseModelHiddenStates – [GPU, Float16] Hidden states input from base model with shape [batch_size, N_padded, base-Hidden-dim], denote hidden states corresponding to token_ids of [1 ~ N-1]
draftModelHiddenStates – [GPU, Float16] The input [batch_size, N_padded, draft-Hidden-input-dim] is unused in the prefill step, but it is required by the engine execution. The input shall be set to all zeros to ensure correctness
contextLengths – [CPU, Int32] The actual sequence length for each batch with shape [batch_size] (including the +1 token from base prefill)
multimodalEmbeddings – [GPU] Optional. The multimodal embeddings
outputLogits – [GPU, Float16] The output logits with shape [batch_size, draft-Vocab-Size]
outputHiddenStates – [GPU] The output hidden states with shape [batch_size, draft-hidden-dim]
stream – The CUDA stream to execute the prefill step
- Returns:
True if execution was successful, false otherwise
- bool executeEagleDraftProposalStep(
- rt::Tensor const &draftTreeInputIds,
- rt::Tensor const &baseModelHiddenStates,
- rt::Tensor const &draftModelHiddenStates,
- rt::Tensor const &draftTreeLength,
- rt::Tensor const &draftTreeMask,
- rt::Tensor &outputLogits,
- rt::Tensor &outputHiddenStates,
- cudaStream_t stream
API entry to execute the draft proposal step for the eagle draft engine.
The API will takes a draft tree of input_token_ids and hidden-states from the draft model. DraftTreeMask denote the relationship between the draft tree nodes, draft tree length denote the “real” length of the draft tree. To efficiently use cuda graph and reduce implementation complexity, the input length will be padded to accommodate the maximum draft tree size.
Note
The API will automatically collect the “last” topK logits and hidden-states counting from the tail of “real” draft tree size. Caller shall specify the topK parameter through tensor dimension. Also this API will NOT “commit” the KVCache during execution.
- Parameters:
draftTreeInputIds – [GPU, Int32] Input token_ids for the draft model with shape [1, padded-draft-Tree-Size]
baseModelHiddenStates – [GPU, Float16] The input [1, padded-draft-Tree-Size, base-Hidden-Dim] is unused in the draft proposal step, but it is required by the engine execution. The input shall be set to all zeros to ensure correctness
draftModelHiddenStates – [GPU, Float16] Hidden states input from draft model with shape [1, padded-draft-Tree-Size, draft-Hidden-Dim], denote hidden states corresponding to token_ids of [1 ~ draft-Tree-Size]
draftTreeLength – [GPU, Int32] Denote the “real” length of the draft tree with shape [1]
draftTreeMask – [GPU, Int32] Denote the relationship between the draft tree nodes with shape [1, padded-draft-Tree-Size, padded-draft-Tree-Size]
outputLogits – [GPU, Float16] The output logits with shape [topK, draft-Vocab-Size]
outputHiddenStates – [GPU] The output hidden states with shape [topK, draft-hidden-dim]
stream – The CUDA stream to execute the draft proposal step
- Returns:
True if execution was successful, false otherwise
- bool executeEagleAcceptDecodeTokenStep(
- rt::Tensor const &acceptedTokens,
- rt::Tensor const &baseModelHiddenStates,
- rt::Tensor const &draftModelHiddenStates,
- rt::Tensor const &acceptedTokenNums,
- rt::Tensor &outputLogits,
- rt::Tensor &outputHiddenStates,
- cudaStream_t stream
API entry for the eagle draft model to accept the “committed” token from the base model.
The functionality is similar to the prefill step where this API will operates based on the previous committed KVCache. Output logits and hidden-states will be collected from the last accepted token.
Note
This API will “commit” the KVCache for the accepted tokens.
- Parameters:
acceptedTokens – [GPU, Int32] The accepted tokens with shape [batch_size, N_accepted_padded] where N_accepted_padded is the maximum accepted length across all batches
baseModelHiddenStates – [GPU, Float16] Hidden states input from base model with shape [batch_size, N_accepted_padded, base-Hidden-Dim]
draftModelHiddenStates – [GPU, Float16] The input [batch_size, N_accepted_padded, draft-Hidden-Dim] is unused in the accept decode token step, but it is required by the engine execution. The input shall be set to all zeros to ensure correctness
acceptedTokenNums – [GPU, Int32] The actual number of accepted tokens for each batch with shape [batch_size], used to handle variable-length acceptance per sequence
outputLogits – [GPU, Float16] The output logits with shape [batch_size, draft-Vocab-Size]
outputHiddenStates – [GPU] The output hidden states with shape [batch_size, draft-hidden-dim]
stream – The CUDA stream to execute the accept decode token step
- Returns:
True if execution was successful, false otherwise
- bool captureEagleDraftProposalCudaGraph(
- rt::Tensor const &draftTreeInputIds,
- rt::Tensor const &baseModelHiddenStates,
- rt::Tensor const &draftModelHiddenStates,
- rt::Tensor const &draftTreeLength,
- rt::Tensor const &draftTreeMask,
- rt::Tensor &outputLogits,
- rt::Tensor &outputHiddenStates,
- cudaStream_t stream
API entry to capture the CUDA graph for the draft proposal step.
The API will capture the CUDA graph for the draft proposal step.
- Parameters:
draftTreeInputIds – [GPU, Int32] Input token_ids for the draft model with shape [1, padded-draft-Tree-Size]
baseModelHiddenStates – [GPU, Float16] The input [1, padded-draft-Tree-Size, base-Hidden-Dim] is unused in the draft proposal step, but it is required by the engine execution. The input shall be set to all zeros to ensure correctness
draftModelHiddenStates – [GPU, Float16] Hidden states input from draft model with shape [1, padded-draft-Tree-Size, draft-Hidden-Dim], denote hidden states corresponding to token_ids of [1 ~ draft-Tree-Size]
draftTreeLength – [GPU, Int32] Denote the “real” length of the draft tree with shape [1]
draftTreeMask – [GPU, Int32] Denote the relationship between the draft tree nodes with shape [1, padded-draft-Tree-Size, padded-draft-Tree-Size]
outputLogits – [GPU, Float16] The output logits with shape [topK, draft-Vocab-Size]
outputHiddenStates – [GPU] The output hidden states with shape [topK, draft-hidden-dim]
stream – The CUDA stream to capture the CUDA graph. The API will capture the CUDA graph for the draft proposal step
- Returns:
True if the CUDA graph is captured successfully, false otherwise
- bool captureEagleAcceptDecodeTokenCudaGraph(
- rt::Tensor const &acceptedTokens,
- rt::Tensor const &baseModelHiddenStates,
- rt::Tensor const &draftModelHiddenStates,
- rt::Tensor const &acceptedTokenNums,
- rt::Tensor &outputLogits,
- rt::Tensor &outputHiddenStates,
- cudaStream_t stream
API entry for capturing the CUDA graph for the accept decode token step.
The functionality is similar to the draft proposal step where this API will operates based on the previous committed KVCache. Output logits and hidden-states will be collected from the last accepted token.
- Parameters:
acceptedTokens – [GPU, Int32] The accepted tokens with shape [batch_size, N_accepted_padded] where N_accepted_padded is the maximum accepted length across all batches
baseModelHiddenStates – [GPU, Float16] Hidden states input from base model with shape [batch_size, N_accepted_padded, base-Hidden-Dim]
draftModelHiddenStates – [GPU, Float16] The input [batch_size, N_accepted_padded, draft-Hidden-Dim] is unused in the accept decode token step, but it is required by the engine execution. The input shall be set to all zeros to ensure correctness
acceptedTokenNums – [GPU, Int32] The actual number of accepted tokens for each batch with shape [batch_size], used to handle variable-length acceptance per sequence
outputLogits – [GPU, Float16] The output logits with shape [batch_size, draft-Vocab-Size]
outputHiddenStates – [GPU] The output hidden states with shape [batch_size, draft-hidden-dim]
stream – The CUDA stream to capture the CUDA graph. The API will capture the CUDA graph for the accept decode token step
- Returns:
True if the CUDA graph is captured successfully, false otherwise
-
struct EagleDraftEngineRunnerConfig#
Configuration structure for the Eagle Draft Engine Runner.
Public Members
-
RopeConfig ropeConfig = {}#
RoPE configuration.
-
int32_t numDecoderLayers = {}#
Number of decoder layers in the draft model.
-
int32_t numKVHeads = {}#
Number of key-value heads.
-
int32_t headDim = {}#
Dimension of each attention head.
-
int32_t rotaryDim = {}#
Dimension of rotary positional encoding.
-
int32_t maxSupportedBatchSize = {}#
Maximum supported batch size.
-
int32_t maxSupportedInputLength = {}#
Maximum supported input length.
-
int32_t maxKVCacheCapacity = {}#
Maximum KV cache capacity.
-
int32_t draftModelVocabSize = {}#
Vocabulary size of the draft model.
-
int32_t maxDraftTreeSize = {}#
Maximum size of the draft tree.
-
int32_t baseModelHiddenDim = {}#
Hidden dimension of the base model.
-
int32_t draftModelHiddenDim = {}#
Hidden dimension of the draft model.
-
bool isVlm = {false}#
Flag indicating if this is a vision-language model.
-
RopeConfig ropeConfig = {}#