Phi4mm ViT Runner#

class Phi4MMViTRunner : public trt_edgellm::rt::MultimodalRunner #

Runner for Phi-4MM vision encoder.

This class handles:

Image preprocessing (HWC uint8 → normalized FP16 HWC on GPU)
Tiling to per-block CHW layout for the TRT visual engine
Running the visual engine to produce raw 256-per-block visual tokens
Batched HD postprocess to assemble sub/global grids with newline and GN tokens
Text preprocessing to expand image placeholders into a contiguous id range

Public Functions

Phi4MMViTRunner(std::string const &engineDir, cudaStream_t stream)#

Constructor for Phi4MMViTRunner.

Parameters:

engineDir – [in] Directory containing the TensorRT engine files
stream – [in] CUDA stream for execution

~Phi4MMViTRunner() = default#

virtual bool preprocess( rt::LLMGenerationRequest const &request, std::vector<std::vector<int32_t>> &batchedInputIds, tokenizer::Tokenizer *tokenizer, rt::Tensor &ropeRotaryCosSinDevice, cudaStream_t stream ) override#

Preprocess multimodal input including images and text.

Parameters:

request – [in] LLM generation request containing images and text
batchedInputIds – [inout] Batched input token IDs after preprocessing
tokenizer – [in] Tokenizer for text processing
ropeRotaryCosSinDevice – [inout] RoPE rotary position encoding cache (unused for Phi-4MM)
stream – [in] CUDA stream for execution

Returns:

True if preprocessing succeeded, false otherwise

virtual bool infer(cudaStream_t stream) override#

Run inference on the vision encoder and perform HD postprocess.

Parameters:: stream – [in] CUDA stream for execution
Returns:: True if inference succeeded, false otherwise

virtual bool validateAndFillConfig( std::string const &configPath ) override#

Validate and load configuration from JSON file.

Parameters:: configPath – [in] Path to configuration file
Returns:: True if configuration is valid and loaded successfully, false otherwise

virtual bool allocateBuffer(cudaStream_t stream) override#

Allocate buffers for inference and postprocess.

Parameters:: stream – [in] CUDA stream for execution
Returns:: True if allocation succeeded, false otherwise

struct Phi4MMViTConfig#

Configuration for Phi4MMViT vision encoder.

This configuration aggregates vision-tower-derived dimensions (num blocks, channels, output hidden size), tokenizer-related settings for image token expansion, and image normalization parameters used by the CUDA preprocess kernels.

Public Members

int32_t maxNumBlocks = {0}#: Maximum number of image blocks supported by engine.

int32_t minNumBlocks = {0}#: Minimum number of image blocks supported by engine.

int32_t numChannels = {3}#: Image channels (RGB=3)

int32_t outHiddenSize = {0}#: Visual output hidden size (projection dim)

int32_t imageTokenId = {200010}#: Placeholder token id in text to be expanded into image tokens.

int32_t vocabSize = {0}#: Base vocabulary size; image ids start from vocabSize.

std::array<float, 3> imageMean = {{0.5F, 0.5F, 0.5F}}#: Mean per channel used in normalize: (val/255 - mean)/std.

std::array<float, 3> imageStd = {{0.5F, 0.5F, 0.5F}}#: Std per channel used in normalize.

int32_t minImageTokensPerImage = {0}#: Min visual tokens per image (for resize/grid selection)

int32_t maxImageTokensPerImage = {0}#: Max visual tokens per image (for resize/grid selection)

int32_t blockImageSizeH = {0}#: Block image height (crop size)

int32_t blockImageSizeW = {0}#: Block image width (crop size)

int32_t blockDownsampleRatio = {28}#: Block downsample ratio.

int32_t tokensPerSide = {0}#: Number of tokens per dimension.