Multimodal Runner#
-
class MultimodalRunner#
Base class for multimodal vision-language model runners.
Provides interface for vision encoder processing in VLMs. Subclasses implement specific VLM architectures (Qwen-VL, InternVL, etc.).
Subclassed by trt_edgellm::rt::InternViTRunner, trt_edgellm::rt::Phi4MMViTRunner, trt_edgellm::rt::QwenViTRunner
Public Functions
-
MultimodalRunner() = default#
Default constructor.
-
MultimodalRunner(std::string const &engineDir, cudaStream_t stream)#
Construct multimodal runner.
- Parameters:
engineDir – Directory containing engine files
stream – CUDA stream for operations
-
virtual ~MultimodalRunner() = default#
Virtual destructor.
- virtual bool preprocess(
- rt::LLMGenerationRequest const &request,
- std::vector<std::vector<int32_t>> &batchedInputIds,
- tokenizer::Tokenizer *tokenizer,
- rt::Tensor &ropeRotaryCosSinDevice,
- cudaStream_t stream
Preprocess request with images and text.
- Parameters:
request – Generation request with prompts and images
batchedInputIds – Output batched input token IDs
tokenizer – Tokenizer instance
ropeRotaryCosSinDevice – RoPE cache tensor
stream – CUDA stream
- Returns:
True on success, false on failure
- virtual bool preprocessSystemPrompt(
- std::string const &systemPrompt,
- tokenizer::Tokenizer *tokenizer,
- rt::Tensor &ropeRotaryCosSinDevice,
- cudaStream_t stream
Used for KVCache saving where we need to conduct the tokenization of the system prompt and generate ND-Rope parameters for the system prompt.
- Parameters:
systemPrompt – System prompt text
tokenizer – Tokenizer instance
ropeRotaryCosSinDevice – RoPE cache tensor
stream – CUDA stream
- Returns:
True on success, false on failure
-
virtual bool infer(cudaStream_t stream) = 0#
Run multimodal inference.
- Parameters:
stream – CUDA stream
- Returns:
True on success, false on failure
-
virtual rt::Tensor &getOutputEmbedding()#
Get output embeddings from vision encoder.
- Returns:
Reference to output embedding tensor
-
virtual rt::OptionalInputTensors getExtraVisualFeatures()#
Get extra visual features.
- Returns:
Optional input tensors vector (e.g. deepstack features for Qwen3-VL)
-
virtual bool validateAndFillConfig(std::string const &engineDir) = 0#
Validate and fill configuration from file.
- Parameters:
engineDir – Path to engine directory
- Returns:
True on success, false on failure
-
virtual bool allocateBuffer(cudaStream_t stream) = 0#
Allocate device buffers.
- Returns:
True on success, false on failure
-
inline virtual multimodal::ModelType getModelType() const#
Get model type.
- Returns:
Model type enum
- inline metrics::MultimodalMetrics const &getMultimodalMetrics(
Get multimodal processing metrics.
- Returns:
Multimodal metrics
Public Static Functions
- static std::unique_ptr<MultimodalRunner> create(
- std::string const &multimodalEngineDir,
- int32_t llmMaxBatchSize,
- int64_t llmMaxPositionEmbeddings,
- cudaStream_t stream
Create appropriate multimodal runner instance.
Factory method that detects model type and creates corresponding runner.
- Parameters:
multimodalEngineDir – Directory containing multimodal engine files
llmMaxBatchSize – Maximum batch size from LLM engine
llmMaxPositionEmbeddings – Maximum position embeddings from LLM engine
stream – CUDA stream for operations
- Returns:
Unique pointer to created runner
-
MultimodalRunner() = default#