Image Util Kernels#

struct Phi4MMIndex#

Phi4MMIndex Device-side index and size metadata for Phi-4MM HD packing. Fields:

  • hBlocks/wBlocks [numImages]: per-image grid sizes (hb = H/blockImageSizeH, wb = W/blockImageSizeW)

  • srcGlbStart [numImages]: starting raw-token offset for the tokensPerSide x tokensPerSide global grid of image i

  • srcSubStart [numImages]: starting raw-token offset for sub-grid tokens of image i

  • dstOutStart [numImages]: starting packed-token offset in dst for image i

  • subOutLen [numImages]: sub segment token count per image (includes one newline per row)

  • numImages: batch size

  • hidden: embedding length

  • totalOutTokens: total tokens to be written across all images

Public Members

int32_t const *hBlocks#
int32_t const *wBlocks#
int64_t const *srcGlbStart#
int64_t const *srcSubStart#
int64_t const *dstOutStart#
int64_t const *subOutLen#
int32_t numImages#
int32_t hidden#
int64_t totalOutTokens#
struct Phi4MMGN#

Phi4MMGN Grid Newline (GN) and separator embeddings.

  • subGN [hidden] FP16: newline token vector inserted at the end of each sub-grid row

  • glbGN [hidden] FP16: single separator token placed between sub and global segments

Public Members

half const *subGN#
half const *glbGN#
void trt_edgellm::kernel::normalizeImage(
rt::Tensor const &originalImage,
rt::Tensor const &mean,
rt::Tensor const &std,
rt::Tensor &normalizedImage,
cudaStream_t stream
)#

The kernel will normalize image data and convert to half Inputs: originalImage [GPU, UInt8]: [batch, height, width, channels] mean [GPU, Float]: [channels] std [GPU, Float]: [channels] stream: CUDA stream for execution Outputs: normalizedImage [GPU, Half]: [batch, height, width, channels]

Throws:

std::runtime_error – if image has invalid shape, data type or location

void trt_edgellm::kernel::transposeToPatchQwenViT(
rt::Tensor const &originalImage,
rt::Tensor &inputPatches,
int64_t const inputOffset,
int64_t const temporalPatchSize,
int64_t const patchSize,
int64_t const mergeSize,
cudaStream_t stream
)#

The kernel will transpose image data to patch format for Qwen2-VL and Qwen2.5-VL VIT The transpose is corresponding to the following python code: huggingface/transformers Inputs: originalImage [GPU, Half]: Current image [T, height, width, channels] inputOffset: Offset of the input patches, denoting the start index of the current image temporalPatchSize: Temporal patch size for the vision transformer patchSize: Patch size for the vision transformer mergeSize: Merge size for the vision transformer stream: CUDA stream for execution Outputs: inputPatches [GPU, Half]: Total VIT input tensor of all images [totalSeqLength, inputDim] curSeqLength = gridT * gridH * gridW * mergeSize * mergeSize totalSeqLength = sum(curSeqLength) over all images inputDim = channels * temporalPatchSize * patchSize * patchSize

Throws:

std::runtime_error – if image has invalid shape, data type or location

void trt_edgellm::kernel::transposeToPatchInternVLPhi4MM(
rt::Tensor const &originalImage,
rt::Tensor &inputPatches,
int64_t const inputOffset,
cudaStream_t stream
)#

The kernel will transpose image data to patch format for InternVL VIT Inputs: originalImage [GPU, Half]: Current image [1, height, width, channels] inputOffset: Offset of the input patches, denoting the start index of the current image stream: CUDA stream for execution Outputs: inputPatches [GPU, Half]: Total VIT input tensor of all images [totalNumBlocks, channels, blockSizeH, blockSizeW] curNumBlocks = blockH * blockW totalNumBlocks = sum(curNumBlocks) over all images

Throws:

std::runtime_error – if image has invalid shape, data type or location

void trt_edgellm::kernel::phi4mmPostprocessVisionTokens(
rt::Tensor const &srcEmbedding,
rt::Tensor &dstEmbedding,
Phi4MMIndex const &indices,
Phi4MMGN const &gn,
int64_t totalOutTokens,
cudaStream_t stream
)#

phi4mmPostprocessVisionTokens Purpose: Construct the Phi-4MM HD image token sequence for a batch by gathering from raw ViT tokens and inserting Grid Newline (GN) separators.

Inputs:

  • src: [numViTTokens, hidden] FP16 Raw ViT tokens for all images (global + sub), concatenated across images.

  • dst: [totalOutTokens, hidden] FP16 Output buffer for the packed HD sequence.

  • idx: Phi4MMIndex (device indices and sizes)

  • gn: Phi4MMGN (newline and separator embeddings) Output layout per image (contiguous in dst): 1) Sub segment: rows = tokensPerSide*hb, cols = tokensPerSide*wb, strideOut = cols+1; last col is subGN (newline). Non-newline positions gather from src via (srcSubStart + blockId*256 + patchId). 2) One glb_GN token (glbGN). 3) Global segment: 16x16 grid with strideOut = 17; last col is subGN; others gather from srcGlbStart.

Launch config:

  • gridDim.x = idx.totalOutTokens, blockDim.x = 128

  • Each CUDA block writes one output token vector; threads cooperate to copy idx.hidden elements.

Throws:

std::runtime_error – invalid tensor shape, location or data type

void trt_edgellm::kernel::initRotaryPosEmbQwenViT(
rt::Tensor &rotaryPosEmb,
std::vector<int64_t> const &gridTHW,
int64_t const mergeSize,
int64_t const startIdx,
float const rotaryBaseFrequency,
float const scale,
cudaStream_t stream
)#

The kernel will initialize the rotary position embeddings for Qwen2.5-VL VIT Inputs: gridTHW: Image grid dimensions [T, H, W] (Temporal, Height, Width) mergeSize: Merge size for the vision transformer startIdx: Start index for the current image rotaryBaseFrequency: Rotary base frequency scale: Scale for the rotary position embeddings stream: CUDA stream for execution Outputs: rotaryPosEmb [GPU, Float]: Rotary position embeddings tensor [totalSeqLength, vitPosEmbDim]

Throws:

std::runtime_error – if image has invalid shape, data type or location

void trt_edgellm::kernel::initFastPosEmbedQwenViT(
rt::Tensor &fastPosEmbedIdx,
rt::Tensor &fastPosEmbedWeight,
std::vector<int64_t> const &gridTHW,
int64_t const mergeSize,
int64_t const numGridPerSide,
int64_t const startIdx,
cudaStream_t stream
)#

The kernel will initialize the fast position embeddings for Qwen3-VL VIT Inputs gridTHW: Image grid dimensions [T, H, W] (only H and W are used) mergeSize: Merge size for the vision transformer numGridPerSide: Number of grid per side for the vision transformer startIdx: Start index for the image stream: CUDA stream for execution Outputs: fastPosEmbedIdx [GPU, Int64]: Fast position embeddings index tensor [4, totalSeqLength] fastPosEmbedWeight [GPU, Half]: Fast position embeddings weight tensor [4, totalSeqLength]

Throws:

std::runtime_error – if image has invalid shape, data type or location