Image Util Kernels#

struct Phi4MMIndex#

Phi4MMIndex Device-side index and size metadata for Phi-4MM HD packing. Fields:

hBlocks/wBlocks [numImages]: per-image grid sizes (hb = H/blockImageSizeH, wb = W/blockImageSizeW)
srcGlbStart [numImages]: starting raw-token offset for the tokensPerSide x tokensPerSide global grid of image i
srcSubStart [numImages]: starting raw-token offset for sub-grid tokens of image i
dstOutStart [numImages]: starting packed-token offset in dst for image i
subOutLen [numImages]: sub segment token count per image (includes one newline per row)
numImages: batch size
hidden: embedding length
totalOutTokens: total tokens to be written across all images

Public Members

int32_t const *hBlocks#

int32_t const *wBlocks#

int64_t const *srcGlbStart#

int64_t const *srcSubStart#

int64_t const *dstOutStart#

int64_t const *subOutLen#

int32_t numImages#

int32_t hidden#

int64_t totalOutTokens#

struct Phi4MMGN#

Phi4MMGN Grid Newline (GN) and separator embeddings.

subGN [hidden] FP16: newline token vector inserted at the end of each sub-grid row
glbGN [hidden] FP16: single separator token placed between sub and global segments

Public Members

half const *subGN#

half const *glbGN#

void trt_edgellm::kernel::normalizeImage( rt::Tensor const &originalImage, rt::Tensor const &mean, rt::Tensor const &std, rt::Tensor &normalizedImage, cudaStream_t stream )#

The kernel will normalize image data and convert to half Inputs: originalImage [GPU, UInt8]: [batch, height, width, channels] mean [GPU, Float]: [channels] std [GPU, Float]: [channels] stream: CUDA stream for execution Outputs: normalizedImage [GPU, Half]: [batch, height, width, channels]

Throws:: std::runtime_error – if image has invalid shape, data type or location

void trt_edgellm::kernel::transposeToPatchQwenViT( rt::Tensor const &originalImage, rt::Tensor &inputPatches, int64_t const inputOffset, int64_t const temporalPatchSize, int64_t const patchSize, int64_t const mergeSize, cudaStream_t stream )#

The kernel will transpose image data to patch format for Qwen2-VL and Qwen2.5-VL VIT The transpose is corresponding to the following python code: huggingface/transformers Inputs: originalImage [GPU, Half]: Current image [T, height, width, channels] inputOffset: Offset of the input patches, denoting the start index of the current image temporalPatchSize: Temporal patch size for the vision transformer patchSize: Patch size for the vision transformer mergeSize: Merge size for the vision transformer stream: CUDA stream for execution Outputs: inputPatches [GPU, Half]: Total VIT input tensor of all images [totalSeqLength, inputDim] curSeqLength = gridT * gridH * gridW * mergeSize * mergeSize totalSeqLength = sum(curSeqLength) over all images inputDim = channels * temporalPatchSize * patchSize * patchSize

Throws:: std::runtime_error – if image has invalid shape, data type or location

void trt_edgellm::kernel::transposeToPatchInternVLPhi4MM( rt::Tensor const &originalImage, rt::Tensor &inputPatches, int64_t const inputOffset, cudaStream_t stream )#

The kernel will transpose image data to patch format for InternVL VIT Inputs: originalImage [GPU, Half]: Current image [1, height, width, channels] inputOffset: Offset of the input patches, denoting the start index of the current image stream: CUDA stream for execution Outputs: inputPatches [GPU, Half]: Total VIT input tensor of all images [totalNumBlocks, channels, blockSizeH, blockSizeW] curNumBlocks = blockH * blockW totalNumBlocks = sum(curNumBlocks) over all images

Throws:: std::runtime_error – if image has invalid shape, data type or location

void trt_edgellm::kernel::phi4mmPostprocessVisionTokens( rt::Tensor const &srcEmbedding, rt::Tensor &dstEmbedding, Phi4MMIndex const &indices, Phi4MMGN const &gn, int64_t totalOutTokens, cudaStream_t stream )#

phi4mmPostprocessVisionTokens Purpose: Construct the Phi-4MM HD image token sequence for a batch by gathering from raw ViT tokens and inserting Grid Newline (GN) separators.

Inputs:

src: [numViTTokens, hidden] FP16 Raw ViT tokens for all images (global + sub), concatenated across images.
dst: [totalOutTokens, hidden] FP16 Output buffer for the packed HD sequence.
idx: Phi4MMIndex (device indices and sizes)
gn: Phi4MMGN (newline and separator embeddings) Output layout per image (contiguous in dst): 1) Sub segment: rows = tokensPerSide*hb, cols = tokensPerSide*wb, strideOut = cols+1; last col is subGN (newline). Non-newline positions gather from src via (srcSubStart + blockId*256 + patchId). 2) One glb_GN token (glbGN). 3) Global segment: 16x16 grid with strideOut = 17; last col is subGN; others gather from srcGlbStart.

Launch config:

gridDim.x = idx.totalOutTokens, blockDim.x = 128
Each CUDA block writes one output token vector; threads cooperate to copy idx.hidden elements.

Throws:: std::runtime_error – invalid tensor shape, location or data type

void trt_edgellm::kernel::initRotaryPosEmbQwenViT( rt::Tensor &rotaryPosEmb, std::vector<int64_t> const &gridTHW, int64_t const mergeSize, int64_t const startIdx, float const rotaryBaseFrequency, float const scale, cudaStream_t stream )#

The kernel will initialize the rotary position embeddings for Qwen2.5-VL VIT Inputs: gridTHW: Image grid dimensions [T, H, W] (Temporal, Height, Width) mergeSize: Merge size for the vision transformer startIdx: Start index for the current image rotaryBaseFrequency: Rotary base frequency scale: Scale for the rotary position embeddings stream: CUDA stream for execution Outputs: rotaryPosEmb [GPU, Float]: Rotary position embeddings tensor [totalSeqLength, vitPosEmbDim]

Throws:: std::runtime_error – if image has invalid shape, data type or location

void trt_edgellm::kernel::initFastPosEmbedQwenViT( rt::Tensor &fastPosEmbedIdx, rt::Tensor &fastPosEmbedWeight, std::vector<int64_t> const &gridTHW, int64_t const mergeSize, int64_t const numGridPerSide, int64_t const startIdx, cudaStream_t stream )#

The kernel will initialize the fast position embeddings for Qwen3-VL VIT Inputs gridTHW: Image grid dimensions [T, H, W] (only H and W are used) mergeSize: Merge size for the vision transformer numGridPerSide: Number of grid per side for the vision transformer startIdx: Start index for the image stream: CUDA stream for execution Outputs: fastPosEmbedIdx [GPU, Int64]: Fast position embeddings index tensor [4, totalSeqLength] fastPosEmbedWeight [GPU, Half]: Fast position embeddings weight tensor [4, totalSeqLength]

Throws:: std::runtime_error – if image has invalid shape, data type or location