Multimodal Runner#

class MultimodalRunner#

Base class for multimodal vision-language model runners.

Provides interface for vision encoder processing in VLMs. Subclasses implement specific VLM architectures (Qwen-VL, InternVL, etc.).

Subclassed by trt_edgellm::rt::InternViTRunner, trt_edgellm::rt::Phi4MMViTRunner, trt_edgellm::rt::QwenViTRunner

Public Functions

MultimodalRunner() = default#

Default constructor.

MultimodalRunner(std::string const &engineDir, cudaStream_t stream)#

Construct multimodal runner.

Parameters:
  • engineDir – Directory containing engine files

  • stream – CUDA stream for operations

virtual ~MultimodalRunner() = default#

Virtual destructor.

virtual bool preprocess(
rt::LLMGenerationRequest const &request,
std::vector<std::vector<int32_t>> &batchedInputIds,
tokenizer::Tokenizer *tokenizer,
rt::Tensor &ropeRotaryCosSinDevice,
cudaStream_t stream
) = 0#

Preprocess request with images and text.

Parameters:
  • request – Generation request with prompts and images

  • batchedInputIds – Output batched input token IDs

  • tokenizer – Tokenizer instance

  • ropeRotaryCosSinDevice – RoPE cache tensor

  • stream – CUDA stream

Returns:

True on success, false on failure

virtual bool preprocessSystemPrompt(
std::string const &systemPrompt,
tokenizer::Tokenizer *tokenizer,
rt::Tensor &ropeRotaryCosSinDevice,
cudaStream_t stream
)#

Used for KVCache saving where we need to conduct the tokenization of the system prompt and generate ND-Rope parameters for the system prompt.

Parameters:
  • systemPrompt – System prompt text

  • tokenizer – Tokenizer instance

  • ropeRotaryCosSinDevice – RoPE cache tensor

  • stream – CUDA stream

Returns:

True on success, false on failure

virtual bool infer(cudaStream_t stream) = 0#

Run multimodal inference.

Parameters:

stream – CUDA stream

Returns:

True on success, false on failure

virtual rt::Tensor &getOutputEmbedding()#

Get output embeddings from vision encoder.

Returns:

Reference to output embedding tensor

virtual rt::OptionalInputTensors getExtraVisualFeatures()#

Get extra visual features.

Returns:

Optional input tensors vector (e.g. deepstack features for Qwen3-VL)

virtual bool validateAndFillConfig(std::string const &engineDir) = 0#

Validate and fill configuration from file.

Parameters:

engineDir – Path to engine directory

Returns:

True on success, false on failure

virtual bool allocateBuffer(cudaStream_t stream) = 0#

Allocate device buffers.

Returns:

True on success, false on failure

inline virtual multimodal::ModelType getModelType() const#

Get model type.

Returns:

Model type enum

inline metrics::MultimodalMetrics const &getMultimodalMetrics(
) const#

Get multimodal processing metrics.

Returns:

Multimodal metrics

Public Static Functions

static std::unique_ptr<MultimodalRunner> create(
std::string const &multimodalEngineDir,
int32_t llmMaxBatchSize,
int64_t llmMaxPositionEmbeddings,
cudaStream_t stream
)#

Create appropriate multimodal runner instance.

Factory method that detects model type and creates corresponding runner.

Parameters:
  • multimodalEngineDir – Directory containing multimodal engine files

  • llmMaxBatchSize – Maximum batch size from LLM engine

  • llmMaxPositionEmbeddings – Maximum position embeddings from LLM engine

  • stream – CUDA stream for operations

Returns:

Unique pointer to created runner