LLM Inference Spec Decode Runtime#
-
class LLMInferenceSpecDecodeRuntime#
LLM inference runtime with Eagle speculative decoding.
Manages inference pipeline using Eagle speculative decoding for improved throughput. Coordinates base model, draft model, and multimodal processing.
Public Functions
- LLMInferenceSpecDecodeRuntime(
- std::string const &engineDir,
- std::string const &multimodalEngineDir,
- EagleDraftingConfig const &draftingConfig,
- cudaStream_t stream
Construct speculative decode runtime.
- Parameters:
engineDir – Directory containing engine files
multimodalEngineDir – Directory containing multimodal engine files
draftingConfig – Eagle drafting configuration
stream – CUDA stream for operations
-
~LLMInferenceSpecDecodeRuntime() = default#
Destructor.
-
bool captureDecodingCudaGraph(cudaStream_t stream)#
Capture CUDA graphs for Eagle decoding stages to optimize performance.
Captures graphs for draft proposal, draft accept token, base verification, and base vanilla decoding across all supported batch sizes.
Note
If capture fails for any stage, the inference can proceed without CUDA graph capture, but at cost of performance degradation.
- Parameters:
stream – CUDA stream
- Returns:
True if all stage captures succeed, false otherwise
- bool handleRequest(
- LLMGenerationRequest const &request,
- LLMGenerationResponse &response,
- cudaStream_t stream
Handle generation request.
- Parameters:
request – Generation request with prompts and parameters
response – Output response with generated tokens and text
stream – CUDA stream
- Returns:
True on success, false on failure
-
inline metrics::LLMPrefillMetrics const &getPrefillMetrics() const#
Get LLM prefill stage metrics.
- inline metrics::EagleGenerationMetrics const &getEagleGenerationMetrics(
Get Eagle generation stage metrics.
-
inline metrics::MultimodalMetrics getMultimodalMetrics() const#
Get multimodal metrics (returns empty metrics if no multimodal runner)
-
struct BatchResult#
Batch result data for a single sequence.
Encapsulates all data needed to track a batch’s execution results, whether it’s active or evicted. Groups related fields together for better cache locality and maintainability.
Public Members
-
std::vector<int32_t> tokenIds#
Generated token IDs.
-
std::vector<int32_t> rawBatchedInputIds#
Original input token IDs.
-
int32_t generateLength = {0}#
Number of tokens generated.
-
int32_t actualIterations = {0}#
Number of iterations executed.
-
int32_t effectivePrefillLength = {0}#
Effective prefill length (excluding reused KVCache length)
-
std::vector<int32_t> tokenIds#
-
struct SpecDecodeInferenceContext#
Execution context for speculative decode runtime.
Holds execution information and intermediate metadata during inference. Supports multi-batch inference with independent sequence tracking.
Public Functions
- void initialize(
- int32_t batchSize,
- int32_t maxGenLength,
- rt::OptionalInputTensor const &multimodal,
- rt::OptionalInputTensors const &deepstackFeatures,
- cudaStream_t cudaStream
Initialize the context with given parameters.
- Parameters:
batchSize – Active batch size
maxGenLength – Maximum generation length
multimodal – Optional multimodal embeddings
deepstackFeatures – Deepstack features for Qwen3-VL (raw features before embedding)
cudaStream – CUDA stream for operations
Public Members
-
std::vector<std::string> systemPrompts#
System prompts for each sequence in batch.
-
std::vector<std::vector<int32_t>> rawBatchedInputIds#
Original token IDs before preprocessing (includes padding and removal of reused system IDs)
-
std::vector<std::vector<int32_t>> tokenIds#
Token IDs for each sequence: [batch_size][seq_length].
-
std::vector<int32_t> currentGenerateLengths#
Current generation length for each sequence: [batch_size].
-
std::vector<int32_t> effectivePrefillLengths#
Effective prefill length (excluding reused KVCache length) [batch_size].
-
std::vector<int8_t> finishedStates#
Finished state for each sequence: [batch_size] (0=not finished, 1=finished)
-
std::unordered_map<int32_t, BatchResult> completedBatches#
Results of completed batches (unified storage)
-
std::vector<int32_t> batchIndexMapping#
Maps current batch index to original index.
-
rt::OptionalInputTensors deepstackFeatures#
Deepstack features for Qwen3-VL (raw features before embedding)
-
int32_t generationRound#
Current generation round (shared across all batches)
-
int32_t maxGenerateLength#
Maximum generation length.
-
int32_t activeBatchSize#
Current active batch size.
-
cudaStream_t stream#
CUDA stream.
-
struct EagleDraftingConfig#
Drafting configuration for Eagle speculative decoding.
Configuration parameters to drive Eagle spec-decoding.