LLM Inference Runtime#

class LLMInferenceRuntime#

LLM Inference Runtime for handling generation requests.

Public Functions

LLMInferenceRuntime( std::string const &engineDir, std::string const &multimodalEngineDir, std::unordered_map<std::string, std::string> const &loraWeightsMap, cudaStream_t stream )#

Construct an LLM Inference Runtime.

Parameters:

engineDir – Directory containing the LLM engine
multimodalEngineDir – Directory containing the multimodal engine
loraWeightsMap – Map of LoRA weights names to their paths
stream – CUDA stream for initialization

~LLMInferenceRuntime() = default#: Destructor.

bool handleRequest( LLMGenerationRequest const &request, LLMGenerationResponse &response, cudaStream_t stream )#

Handle an LLM generation request.

Parameters:

request – The generation request containing prompt and generation parameters
response – The generation response to be filled with output
stream – CUDA stream for execution

Returns:

True if request was handled successfully, false otherwise

bool captureDecodingCUDAGraph(cudaStream_t stream)#

Capture CUDA graph for the decoding step to optimize performance.

Parameters:: stream – CUDA stream for graph capture
Returns:: True if graph was captured successfully, false otherwise

bool genAndSaveSystemPromptKVCache( std::string const &prompt, std::string const &loraWeightsName, cudaStream_t stream )#

Execute the prefill step generation of the KVCache for the prompt and save for later usage.

Parameters:

prompt – The system prompt to generate the KVCache
loraWeightsName – The name of the LoRA weights
stream – The CUDA stream used for the generation

Returns:

True if the KVCache is generated and saved successfully, false otherwise

inline metrics::LLMPrefillMetrics const &getPrefillMetrics() const#

Get LLM prefill stage metrics.

Returns:: Reference to prefill metrics

inline metrics::LLMGenerationMetrics const &getGenerationMetrics( ) const#

Get LLM generation stage metrics.

Returns:: Reference to generation metrics

inline metrics::MultimodalMetrics getMultimodalMetrics() const#

Get multimodal metrics (returns empty metrics if no multimodal runner)

Returns:: Multimodal metrics, or empty metrics if no multimodal runner is available

struct SystemPromptKVCache#

Structure to hold cached system prompt and its KV cache.

Public Members

std::string systemPrompt#: The system prompt text.

std::vector<tokenizer::Rank> tokenizedPrompt#: Tokenized version of the system prompt.

rt::Tensor kvCacheContent#: Cached KV cache content for the system prompt.