LLM Inference Runtime#
-
class LLMInferenceRuntime#
Unified LLM inference runtime with optional speculative decoding.
Manages inference pipeline for vanilla and speculative decoding modes (EAGLE, MTP, etc.). When constructed without a drafting config, operates as a pure vanilla decoding runtime with zero draft-model memory overhead. Coordinates base model, optional draft model, and multimodal processing (vision + audio).
Public Functions
- LLMInferenceRuntime(
- std::string const &engineDir,
- std::string const &multimodalEngineDir,
- std::unordered_map<std::string, std::string> const &loraWeightsMap,
- SpecDecodeDraftingConfig const &draftingConfig,
- cudaStream_t stream
Construct runtime with speculative decoding.
- Parameters:
engineDir – Directory containing engine files
multimodalEngineDir – Directory containing multimodal engine files
loraWeightsMap – Map of LoRA weight names to file paths
draftingConfig – Speculative decoding drafting configuration
stream – CUDA stream for operations
- Throws:
std::runtime_error – if directories do not contain expected data, or runner initialization fails
- LLMInferenceRuntime(
- std::string const &engineDir,
- std::string const &multimodalEngineDir,
- std::unordered_map<std::string, std::string> const &loraWeightsMap,
- cudaStream_t stream
Construct runtime for vanilla-only decoding (no draft model)
- Parameters:
engineDir – Directory containing engine files
multimodalEngineDir – Directory containing multimodal engine files
loraWeightsMap – Map of LoRA weight names to file paths
stream – CUDA stream for operations
- Throws:
std::runtime_error – if directories do not contain expected data, or runner initialization fails
-
~LLMInferenceRuntime() noexcept = default#
Destructor.
-
bool captureDecodingCUDAGraph(cudaStream_t stream)#
Capture CUDA graphs for decoding stages to optimize performance.
When draft model is present, captures graphs for draft proposal, draft accept token, base verification, and base vanilla decoding. Without draft model, captures only vanilla decoding graphs.
Note
If capture fails for any stage, the inference can proceed without CUDA graph capture, but at cost of performance degradation.
- Parameters:
stream – CUDA stream
- Throws:
std::runtime_error – if a tensor reshape operation fails
- Returns:
True if all stage captures succeed, false otherwise
- bool handleRequest(
- LLMGenerationRequest const &request,
- LLMGenerationResponse &response,
- cudaStream_t stream,
- bool outputThinkerEmbeddings = false
Handle generation request.
- Parameters:
request – Generation request with prompts and parameters
response – Output response with generated tokens and text
stream – CUDA stream
- Throws:
std::runtime_error – if an LLM or CUDA operation fails
- Returns:
True on success, false on failure
- bool genAndSaveSystemPromptKVCache(
- std::string const &prompt,
- std::string const &loraWeightsName,
- cudaStream_t stream
Generate and save system prompt KV cache (public API matching standard runtime signature)
- Parameters:
prompt – The system prompt to generate the KVCache
loraWeightsName – The name of the LoRA weights
stream – The CUDA stream used for the generation
- Throws:
std::runtime_error – if a CUDA operation fails
- Returns:
True if the KVCache is generated and saved successfully, false otherwise
-
void setActionNoiseSeed(int32_t seed) noexcept#
Set the random seed used when initializing the action diffusion noise trajectory.
- Parameters:
seed – Random seed value; has no effect if no action runner is loaded
- inline metrics::LLMPrefillMetrics const &getPrefillMetrics(
Get LLM prefill stage metrics.
- inline metrics::SpecDecodeGenerationMetrics const &getSpecDecodeGenerationMetrics(
Get speculative decoding generation stage metrics (only meaningful when draft model is present)
- inline char const *getSpeculativeDecodingStrategyName(
- inline metrics::LLMGenerationMetrics const &getGenerationMetrics(
Get vanilla generation stage metrics (only meaningful when no draft model / vanilla path)
- inline metrics::MultimodalMetrics getMultimodalMetrics(
Get multimodal metrics (returns empty metrics if no multimodal runner)
-
inline rt::Tensor const &getEmbeddingTable() const#
Get the embedding table (for Talker streaming pipeline)
- inline rt::Tensor const *getBaseModelHiddenStates(
- int32_t layerIdx
Get a base model hidden-states buffer for the requested layer index.
Buffers are owned by the runtime and reused across requests. Layer 0 corresponds to the post-multimodal input embeddings (backed up before the decode loop reshapes them); other layer indices correspond to engine-output hidden states (e.g. acceptHiddenLayer for the Qwen3-Omni Talker, or future MTP layers).
Lifetime contract:
Buffers are sized to {maxRuntimeBatchSize, maxSupportedInputLength, hiddenSize}.
Contents are cleared (overwritten) at the start of each handleRequest() call and remain valid until the next handleRequest() begins. The buffer is reshaped to {activeBatchSize, prefillLength, hiddenSize} for the most recent request — use getBaseModelPrefillLength() to query the valid prefill length.
The caller is responsible for consuming the data within that window.
- Parameters:
layerIdx – Layer index. 0 = input embeddings (post-multimodal); other indices are model-specific (e.g. acceptHiddenLayer for Qwen3-Omni Talker).
- Returns:
Pointer to the buffer, or nullptr if no buffer is registered for that layer.
-
inline int32_t getBaseModelPrefillLength() const noexcept#
Number of valid prefill tokens in the hidden-states buffers from the most recent handleRequest() call. Returns 0 if no hidden-states output was requested.
- inline std::vector<std::vector<int32_t>> const &getBaseModelInputTokenIds(
Per-batch input token IDs from the most recent handleRequest() call. Cleared at the start of each handleRequest(); valid until the next one begins.
-
inline bool hasDraftModel() const noexcept#
Check if draft model is loaded and spec-decode is available.