LLM Inference Spec Decode Runtime#

class LLMInferenceSpecDecodeRuntime#

LLM inference runtime with Eagle speculative decoding.

Manages inference pipeline using Eagle speculative decoding for improved throughput. Coordinates base model, draft model, and multimodal processing.

Public Functions

LLMInferenceSpecDecodeRuntime(
std::string const &engineDir,
std::string const &multimodalEngineDir,
EagleDraftingConfig const &draftingConfig,
cudaStream_t stream
)#

Construct speculative decode runtime.

Parameters:
  • engineDir – Directory containing engine files

  • multimodalEngineDir – Directory containing multimodal engine files

  • draftingConfig – Eagle drafting configuration

  • stream – CUDA stream for operations

~LLMInferenceSpecDecodeRuntime() = default#

Destructor.

bool captureDecodingCudaGraph(cudaStream_t stream)#

Capture CUDA graphs for Eagle decoding stages to optimize performance.

Captures graphs for draft proposal, draft accept token, base verification, and base vanilla decoding across all supported batch sizes.

Note

If capture fails for any stage, the inference can proceed without CUDA graph capture, but at cost of performance degradation.

Parameters:

stream – CUDA stream

Returns:

True if all stage captures succeed, false otherwise

bool handleRequest(
LLMGenerationRequest const &request,
LLMGenerationResponse &response,
cudaStream_t stream
)#

Handle generation request.

Parameters:
  • request – Generation request with prompts and parameters

  • response – Output response with generated tokens and text

  • stream – CUDA stream

Returns:

True on success, false on failure

inline metrics::LLMPrefillMetrics const &getPrefillMetrics() const#

Get LLM prefill stage metrics.

inline metrics::EagleGenerationMetrics const &getEagleGenerationMetrics(
) const#

Get Eagle generation stage metrics.

inline metrics::MultimodalMetrics getMultimodalMetrics() const#

Get multimodal metrics (returns empty metrics if no multimodal runner)

struct BatchResult#

Batch result data for a single sequence.

Encapsulates all data needed to track a batch’s execution results, whether it’s active or evicted. Groups related fields together for better cache locality and maintainability.

Public Members

std::vector<int32_t> tokenIds#

Generated token IDs.

std::vector<int32_t> rawBatchedInputIds#

Original input token IDs.

int32_t generateLength = {0}#

Number of tokens generated.

int32_t actualIterations = {0}#

Number of iterations executed.

int32_t effectivePrefillLength = {0}#

Effective prefill length (excluding reused KVCache length)

struct SpecDecodeInferenceContext#

Execution context for speculative decode runtime.

Holds execution information and intermediate metadata during inference. Supports multi-batch inference with independent sequence tracking.

Public Functions

void initialize(
int32_t batchSize,
int32_t maxGenLength,
rt::OptionalInputTensor const &multimodal,
rt::OptionalInputTensors const &deepstackFeatures,
cudaStream_t cudaStream
)#

Initialize the context with given parameters.

Parameters:
  • batchSize – Active batch size

  • maxGenLength – Maximum generation length

  • multimodal – Optional multimodal embeddings

  • deepstackFeatures – Deepstack features for Qwen3-VL (raw features before embedding)

  • cudaStream – CUDA stream for operations

Public Members

std::vector<std::string> systemPrompts#

System prompts for each sequence in batch.

std::vector<std::vector<int32_t>> rawBatchedInputIds#

Original token IDs before preprocessing (includes padding and removal of reused system IDs)

std::vector<std::vector<int32_t>> tokenIds#

Token IDs for each sequence: [batch_size][seq_length].

std::vector<int32_t> currentGenerateLengths#

Current generation length for each sequence: [batch_size].

std::vector<int32_t> effectivePrefillLengths#

Effective prefill length (excluding reused KVCache length) [batch_size].

std::vector<int8_t> finishedStates#

Finished state for each sequence: [batch_size] (0=not finished, 1=finished)

std::unordered_map<int32_t, BatchResult> completedBatches#

Results of completed batches (unified storage)

std::vector<int32_t> batchIndexMapping#

Maps current batch index to original index.

rt::OptionalInputTensor multimodalEmbeddings#

Optional multimodal embeddings.

rt::OptionalInputTensors deepstackFeatures#

Deepstack features for Qwen3-VL (raw features before embedding)

int32_t generationRound#

Current generation round (shared across all batches)

int32_t maxGenerateLength#

Maximum generation length.

int32_t activeBatchSize#

Current active batch size.

cudaStream_t stream#

CUDA stream.

struct EagleDraftingConfig#

Drafting configuration for Eagle speculative decoding.

Configuration parameters to drive Eagle spec-decoding.

Public Members

int32_t draftingTopK#

Tokens to select from one predecessor for next draft tree level.

int32_t draftingStep#

Number of drafting steps with draft model.

int32_t verifyTreeSize#

Number of tokens for base model verification.