Image Util Kernels#

struct Phi4MMIndex#

Phi4MMIndex Device-side index and size metadata for Phi-4MM HD packing. Fields:

hBlocks/wBlocks [numImages]: per-image grid sizes (hb = H/blockImageSizeH, wb = W/blockImageSizeW)
srcGlbStart [numImages]: starting raw-token offset for the tokensPerSide x tokensPerSide global grid of image i
srcSubStart [numImages]: starting raw-token offset for sub-grid tokens of image i
dstOutStart [numImages]: starting packed-token offset in dst for image i
subOutLen [numImages]: sub segment token count per image (includes one newline per row)
numImages: batch size
hidden: embedding length
totalOutTokens: total tokens to be written across all images

Public Members

int32_t const *hBlocks#

int32_t const *wBlocks#

int64_t const *srcGlbStart#

int64_t const *srcSubStart#

int64_t const *dstOutStart#

int64_t const *subOutLen#

int32_t numImages#

int32_t hidden#

int64_t totalOutTokens#

struct Phi4MMGN#

Phi4MMGN Grid Newline (GN) and separator embeddings.

subGN [hidden] FP16: newline token vector inserted at the end of each sub-grid row
glbGN [hidden] FP16: single separator token placed between sub and global segments

Public Members

half const *subGN#

half const *glbGN#

void trt_edgellm::kernel::normalizeImage( rt::Tensor const &originalImage, rt::Tensor const &mean, rt::Tensor const &std, rt::Tensor &normalizedImage, cudaStream_t stream )#

void trt_edgellm::kernel::transposeToPatchQwenViT( rt::Tensor const &originalImage, rt::Tensor &inputPatches, int64_t const inputOffset, int64_t const temporalPatchSize, int64_t const patchSize, int64_t const mergeSize, cudaStream_t stream )#

void trt_edgellm::kernel::transposeToPatchInternVLPhi4MM( rt::Tensor const &originalImage, rt::Tensor &inputPatches, int64_t const inputOffset, cudaStream_t stream )#

void trt_edgellm::kernel::phi4mmPostprocessVisionTokens( rt::Tensor const &srcEmbedding, rt::Tensor &dstEmbedding, Phi4MMIndex const &indices, Phi4MMGN const &gn, int64_t totalOutTokens, cudaStream_t stream )#

phi4mmPostprocessVisionTokens Purpose: Construct the Phi-4MM HD image token sequence for a batch by gathering from raw ViT tokens and inserting Grid Newline (GN) separators.

Inputs:

src: [numViTTokens, hidden] FP16 Raw ViT tokens for all images (global + sub), concatenated across images.
dst: [totalOutTokens, hidden] FP16 Output buffer for the packed HD sequence.
idx: Phi4MMIndex (device indices and sizes)
gn: Phi4MMGN (newline and separator embeddings) Output layout per image (contiguous in dst): 1) Sub segment: rows = tokensPerSide*hb, cols = tokensPerSide*wb, strideOut = cols+1; last col is subGN (newline). Non-newline positions gather from src via (srcSubStart + blockId*256 + patchId). 2) One glb_GN token (glbGN). 3) Global segment: 16x16 grid with strideOut = 17; last col is subGN; others gather from srcGlbStart.

Launch config:

gridDim.x = idx.totalOutTokens, blockDim.x = 128
Each CUDA block writes one output token vector; threads cooperate to copy idx.hidden elements.

void trt_edgellm::kernel::initAttentionMaskQwenViT( rt::Tensor const &cuSeqlens, rt::Tensor &attentionMask, cudaStream_t stream )#

void trt_edgellm::kernel::initRotaryPosEmbQwenViT( rt::Tensor &rotaryPosEmb, std::vector<int64_t> const &gridTHW, int64_t const mergeSize, int64_t const startIdx, float const rotaryBaseFrequency, float const scale, cudaStream_t stream )#

void trt_edgellm::kernel::initFastPosEmbedQwenViT( rt::Tensor &fastPosEmbedIdx, rt::Tensor &fastPosEmbedWeight, std::vector<int64_t> const &gridTHW, int64_t const mergeSize, int64_t const numGridPerSide, int64_t const startIdx, cudaStream_t stream )#