Embedding Kernels#

void trt_edgellm::kernel::embeddingLookup(
rt::Tensor const &inputIds,
rt::Tensor const &embeddingTable,
rt::Tensor &output,
cudaStream_t stream
)#

Standard embedding lookup kernel.

Parameters:
  • inputIds[in] Input token IDs with shape [batchSize, seqLen]

  • embeddingTable[in] Embedding table with shape [vocabSize, hiddenSize]

  • output[out] Hidden states with shape [batchSize, seqLen, hiddenSize]

  • stream[in] CUDA stream for execution

Throws:

std::runtime_error – if tensor shapes or data types are invalid

void trt_edgellm::kernel::embeddingLookupWithImageInsertion(
rt::Tensor const &inputIds,
rt::Tensor const &embeddingTable,
rt::Tensor const &imageEmbeds,
rt::Tensor &output,
cudaStream_t stream
)#

Embedding lookup with image embedding insertion following PromptTuningEmbedding logic.

Parameters:
  • inputIds[in] Input token IDs with shape [batchSize, seqLen]

  • embeddingTable[in] Embedding table with shape [vocabSize, hiddenSize]

  • imageEmbeds[in] Image embeddings with shape [imageTokenLen, hiddenSize]

  • output[out] Hidden states with shape [batchSize, seqLen, hiddenSize]

  • stream[in] CUDA stream for execution

Throws:

std::runtime_error – if tensor shapes or data types are invalid

void trt_edgellm::kernel::assembleDeepstackEmbedding(
rt::Tensor const &inputIds,
rt::Tensor const &deepstackFeatures,
int32_t vocabSize,
rt::Tensor &deepstackEmbeds,
cudaStream_t stream,
int32_t imageTokenId = 0,
rt::OptionalInputTensor multimodalIndices = std::nullopt
)#

Assemble deepstack embeddings by extracting image token embeddings from deepstack features.

This function processes input token IDs and selectively extracts embeddings for image tokens from the provided deepstack features. Image tokens are identified in two ways:

  • Legacy: token IDs >= vocabSize (Qwen2.5-VL where image tokens start at vocabSize)

  • Explicit: token ID == imageTokenId (Qwen3-Omni where image tokens are within vocab)

When multimodalIndices is provided, it is used to index into deepstackFeatures (required for Qwen3-Omni where all image tokens share the same ID). Otherwise falls back to tokenId - vocabSize.

Parameters:
  • inputIds[in] Input token IDs with shape [batchSize, seqLen]

  • deepstackFeatures[in] Deepstack image features with shape [numImageTokens, hiddenSize]

  • vocabSize[in] Vocabulary size (legacy threshold for image token detection)

  • imageTokenId[in] Explicit image token ID (0 = not set, use legacy >= vocabSize detection)

  • multimodalIndices[in] Pre-computed indices for image embeddings [batchSize, seqLen], or std::nullopt to use legacy tokenId - vocabSize indexing

  • deepstackEmbeds[out] Output embeddings with shape [batchSize, seqLen, hiddenSize]

  • stream[in] CUDA stream for execution

Throws:

std::runtime_error – if tensor shapes or data types are invalid

void trt_edgellm::kernel::embeddingLookupMultimodal(
rt::Tensor const &inputIds,
rt::Tensor const &embeddingTable,
rt::OptionalInputTensor multimodalIndices,
std::optional<int32_t> imageTokenId,
rt::OptionalInputTensor imageEmbeds,
std::optional<int32_t> audioTokenId,
rt::OptionalInputTensor audioEmbeds,
rt::Tensor &output,
cudaStream_t stream
)#

Embedding lookup with optional image and audio embeddings for multimodal models.

This kernel handles up to three types of tokens:

  • Normal text tokens (0 <= tokenId < vocabSize): lookup from embeddingTable

  • Image tokens (tokenId == imageTokenId): lookup from imageEmbeds using multimodalIndices (optional)

  • Audio tokens (tokenId == audioTokenId): lookup from audioEmbeds using multimodalIndices (optional)

The multimodalIndices provides pre-computed indices into audioEmbeds/imageEmbeds for each position. For text tokens, the multimodalIndices value is not used. To indicate the presence of a modality, both token ID and the corresponding embedding tensor must be provided.

Note

audioTokenId and imageTokenId are allowed to be smaller than vocabSize, as in the case of Qwen3.

Note

Embeddings should contain data in the order specified by multimodalIndices

Note

When a modality is not needed, pass std::nullopt for both its tokenId and embeds

Note

multimodalIndices can be std::nullopt only when both imageEmbeds and audioEmbeds are std::nullopt

Parameters:
  • inputIds[in] Input token IDs with shape [batchSize, seqLen]

  • embeddingTable[in] Text embedding table with shape [vocabSize, hiddenSize]

  • multimodalIndices[in] Pre-computed indices for audio/image embeddings [batchSize, seqLen], can be std::nullopt if no image/audio inputs are provided

  • imageTokenId[in] Special token ID for image (e.g., 151655 in Qwen3), or std::nullopt if no image

  • imageEmbeds[in] Image embeddings with shape [totalImageTokens, hiddenSize], or std::nullopt if no image

  • audioTokenId[in] Special token ID for audio (e.g., 151675 in Qwen3), or std::nullopt if no audio

  • audioEmbeds[in] Audio embeddings with shape [totalAudioTokens, hiddenSize], or std::nullopt if no audio

  • output[out] Hidden states with shape [batchSize, seqLen, hiddenSize]

  • stream[in] CUDA stream for execution