Image Util Kernels#
-
struct Phi4MMIndex#
Phi4MMIndex Device-side index and size metadata for Phi-4MM HD packing. Fields:
hBlocks/wBlocks [numImages]: per-image grid sizes (hb = H/blockImageSizeH, wb = W/blockImageSizeW)
srcGlbStart [numImages]: starting raw-token offset for the tokensPerSide x tokensPerSide global grid of image i
srcSubStart [numImages]: starting raw-token offset for sub-grid tokens of image i
dstOutStart [numImages]: starting packed-token offset in dst for image i
subOutLen [numImages]: sub segment token count per image (includes one newline per row)
numImages: batch size
hidden: embedding length
totalOutTokens: total tokens to be written across all images
-
struct Phi4MMGN#
Phi4MMGN Grid Newline (GN) and separator embeddings.
subGN [hidden] FP16: newline token vector inserted at the end of each sub-grid row
glbGN [hidden] FP16: single separator token placed between sub and global segments
- void trt_edgellm::kernel::normalizeImage(
- rt::Tensor const &originalImage,
- rt::Tensor const &mean,
- rt::Tensor const &std,
- rt::Tensor &normalizedImage,
- cudaStream_t stream
- void trt_edgellm::kernel::transposeToPatchQwenViT(
- rt::Tensor const &originalImage,
- rt::Tensor &inputPatches,
- int64_t const inputOffset,
- int64_t const temporalPatchSize,
- int64_t const patchSize,
- int64_t const mergeSize,
- cudaStream_t stream
- void trt_edgellm::kernel::transposeToPatchInternVLPhi4MM(
- rt::Tensor const &originalImage,
- rt::Tensor &inputPatches,
- int64_t const inputOffset,
- cudaStream_t stream
- void trt_edgellm::kernel::phi4mmPostprocessVisionTokens(
- rt::Tensor const &srcEmbedding,
- rt::Tensor &dstEmbedding,
- Phi4MMIndex const &indices,
- Phi4MMGN const &gn,
- int64_t totalOutTokens,
- cudaStream_t stream
phi4mmPostprocessVisionTokens Purpose: Construct the Phi-4MM HD image token sequence for a batch by gathering from raw ViT tokens and inserting Grid Newline (GN) separators.
Inputs:
src: [numViTTokens, hidden] FP16 Raw ViT tokens for all images (global + sub), concatenated across images.
dst: [totalOutTokens, hidden] FP16 Output buffer for the packed HD sequence.
idx: Phi4MMIndex (device indices and sizes)
gn: Phi4MMGN (newline and separator embeddings) Output layout per image (contiguous in
dst): 1) Sub segment: rows = tokensPerSide*hb, cols = tokensPerSide*wb, strideOut = cols+1; last col is subGN (newline). Non-newline positions gather from src via (srcSubStart + blockId*256 + patchId). 2) One glb_GN token (glbGN). 3) Global segment: 16x16 grid with strideOut = 17; last col is subGN; others gather from srcGlbStart.
Launch config:
gridDim.x = idx.totalOutTokens, blockDim.x = 128
Each CUDA block writes one output token vector; threads cooperate to copy
idx.hiddenelements.
- void trt_edgellm::kernel::initAttentionMaskQwenViT( )#
- void trt_edgellm::kernel::initRotaryPosEmbQwenViT(
- rt::Tensor &rotaryPosEmb,
- std::vector<int64_t> const &gridTHW,
- int64_t const mergeSize,
- int64_t const startIdx,
- float const rotaryBaseFrequency,
- float const scale,
- cudaStream_t stream