LLM Inference Runtime#

class LLMInferenceRuntime#

Unified LLM inference runtime with optional speculative decoding.

Manages inference pipeline for vanilla and speculative decoding modes (EAGLE, MTP, etc.). When constructed without a drafting config, operates as a pure vanilla decoding runtime with zero draft-model memory overhead. Coordinates base model, optional draft model, and multimodal processing (vision + audio).

Public Functions

LLMInferenceRuntime(
std::string const &engineDir,
std::string const &multimodalEngineDir,
std::unordered_map<std::string, std::string> const &loraWeightsMap,
SpecDecodeDraftingConfig const &draftingConfig,
cudaStream_t stream
)#

Construct runtime with speculative decoding.

Parameters:
  • engineDir – Directory containing engine files

  • multimodalEngineDir – Directory containing multimodal engine files

  • loraWeightsMap – Map of LoRA weight names to file paths

  • draftingConfig – Speculative decoding drafting configuration

  • stream – CUDA stream for operations

Throws:

std::runtime_error – if directories do not contain expected data, or runner initialization fails

LLMInferenceRuntime(
std::string const &engineDir,
std::string const &multimodalEngineDir,
std::unordered_map<std::string, std::string> const &loraWeightsMap,
cudaStream_t stream
)#

Construct runtime for vanilla-only decoding (no draft model)

Parameters:
  • engineDir – Directory containing engine files

  • multimodalEngineDir – Directory containing multimodal engine files

  • loraWeightsMap – Map of LoRA weight names to file paths

  • stream – CUDA stream for operations

Throws:

std::runtime_error – if directories do not contain expected data, or runner initialization fails

~LLMInferenceRuntime() noexcept = default#

Destructor.

bool captureDecodingCUDAGraph(cudaStream_t stream)#

Capture CUDA graphs for decoding stages to optimize performance.

When draft model is present, captures graphs for draft proposal, draft accept token, base verification, and base vanilla decoding. Without draft model, captures only vanilla decoding graphs.

Note

If capture fails for any stage, the inference can proceed without CUDA graph capture, but at cost of performance degradation.

Parameters:

stream – CUDA stream

Throws:

std::runtime_error – if a tensor reshape operation fails

Returns:

True if all stage captures succeed, false otherwise

bool handleRequest(
LLMGenerationRequest const &request,
LLMGenerationResponse &response,
cudaStream_t stream,
bool outputThinkerEmbeddings = false
)#

Handle generation request.

Parameters:
  • request – Generation request with prompts and parameters

  • response – Output response with generated tokens and text

  • stream – CUDA stream

Throws:

std::runtime_error – if an LLM or CUDA operation fails

Returns:

True on success, false on failure

bool genAndSaveSystemPromptKVCache(
std::string const &prompt,
std::string const &loraWeightsName,
cudaStream_t stream
)#

Generate and save system prompt KV cache (public API matching standard runtime signature)

Parameters:
  • prompt – The system prompt to generate the KVCache

  • loraWeightsName – The name of the LoRA weights

  • stream – The CUDA stream used for the generation

Throws:

std::runtime_error – if a CUDA operation fails

Returns:

True if the KVCache is generated and saved successfully, false otherwise

void setActionNoiseSeed(int32_t seed) noexcept#

Set the random seed used when initializing the action diffusion noise trajectory.

Parameters:

seed – Random seed value; has no effect if no action runner is loaded

inline metrics::LLMPrefillMetrics const &getPrefillMetrics(
) const noexcept#

Get LLM prefill stage metrics.

inline metrics::SpecDecodeGenerationMetrics const &getSpecDecodeGenerationMetrics(
) const noexcept#

Get speculative decoding generation stage metrics (only meaningful when draft model is present)

inline char const *getSpeculativeDecodingStrategyName(
) const noexcept#
inline metrics::LLMGenerationMetrics const &getGenerationMetrics(
) const noexcept#

Get vanilla generation stage metrics (only meaningful when no draft model / vanilla path)

inline metrics::MultimodalMetrics getMultimodalMetrics(
) const noexcept#

Get multimodal metrics (returns empty metrics if no multimodal runner)

inline rt::Tensor const &getEmbeddingTable() const#

Get the embedding table (for Talker streaming pipeline)

inline rt::Tensor const *getBaseModelHiddenStates(
int32_t layerIdx
) const noexcept#

Get a base model hidden-states buffer for the requested layer index.

Buffers are owned by the runtime and reused across requests. Layer 0 corresponds to the post-multimodal input embeddings (backed up before the decode loop reshapes them); other layer indices correspond to engine-output hidden states (e.g. acceptHiddenLayer for the Qwen3-Omni Talker, or future MTP layers).

Lifetime contract:

  • Buffers are sized to {maxRuntimeBatchSize, maxSupportedInputLength, hiddenSize}.

  • Contents are cleared (overwritten) at the start of each handleRequest() call and remain valid until the next handleRequest() begins. The buffer is reshaped to {activeBatchSize, prefillLength, hiddenSize} for the most recent request — use getBaseModelPrefillLength() to query the valid prefill length.

  • The caller is responsible for consuming the data within that window.

Parameters:

layerIdx – Layer index. 0 = input embeddings (post-multimodal); other indices are model-specific (e.g. acceptHiddenLayer for Qwen3-Omni Talker).

Returns:

Pointer to the buffer, or nullptr if no buffer is registered for that layer.

inline int32_t getBaseModelPrefillLength() const noexcept#

Number of valid prefill tokens in the hidden-states buffers from the most recent handleRequest() call. Returns 0 if no hidden-states output was requested.

inline std::vector<std::vector<int32_t>> const &getBaseModelInputTokenIds(
) const noexcept#

Per-batch input token IDs from the most recent handleRequest() call. Cleared at the start of each handleRequest(); valid until the next one begins.

inline bool hasDraftModel() const noexcept#

Check if draft model is loaded and spec-decode is available.