Embedding Preprocessor#

class EmbeddingPreprocessor#

Reusable preprocessor that wraps all embedding-lookup kernel calls.

Given token IDs (and optional multimodal embeddings) it writes dense vectors into PipelineIO::inputsEmbeds (and deepstack slots when applicable). The class is intentionally stateless beyond the configuration references so that it can be shared across prefill / decode / system-prompt-cache paths.

Public Functions

EmbeddingPreprocessor( EmbeddingData const &embedding, LLMEngineConfig const &config )#

Construct with the embedding table data and engine configuration.

Both references must outlive the preprocessor.

void embed( Tensor const &tokenIds, OptionalInputTensor visionEmbeds, OptionalInputTensor audioEmbeds, PipelineIO &io, cudaStream_t stream )#

Embed token IDs into dense vectors, optionally inserting multimodal embeddings.

Dispatches to one of three kernel paths depending on the inputs:

Explicit-id multimodal path (kernel::embeddingLookupMultimodal) when audio is present, or when vision is present on an audio-capable model family (Nemotron-Omni / Qwen3-Omni keep <image> in-stream; mConfig.audioTokenId >= 0 identifies these).
Legacy vision path (kernel::embeddingLookupWithImageInsertion) for vision-only families that remap image tokens as vocabSize + k (Qwen2.5-VL, InternVL; audioTokenId == -1).
Text-only path (kernel::embeddingLookup) for pure-text requests.

Parameters:

tokenIds – GPU tensor of token IDs [batchSize, seqLen].
visionEmbeds – Optional vision (image) embeddings.
audioEmbeds – Optional audio embeddings.
io – Pipeline I/O – inputsEmbeds is written.
stream – CUDA stream for execution.

OptionalInputTensors assembleDeepstack( Tensor const &tokenIds, OptionalInputTensors const &features, PipelineIO &io, cudaStream_t stream )#

Assemble deepstack features at image placeholder positions.

For each feature in features, calls kernel::assembleDeepstackEmbedding and writes into the corresponding slot in io.deepstackEmbeds.

Parameters:

tokenIds – GPU tensor of token IDs [batchSize, seqLen].
features – Vector of deepstack feature tensors from the vision runner.
io – Pipeline I/O – deepstackEmbeds[i] is written.
stream – CUDA stream for execution.

Returns:

Vector of const references suitable for engine binding.

void prepareDeepstack( Tensor const &tokenIds, OptionalInputTensors const &features, PipelineIO &io, cudaStream_t stream )#

Prepare deepstack slots for the current step.

Encapsulates the “config has deepstack, features present or missing?” policy so the runtime does not need to gate on numDeepstackFeatures:

no-op when mConfig.numDeepstackFeatures == 0 (non-VLM engine);
assembles real features via assembleDeepstack when features is non-empty;
zero-fills io.deepstackEmbeds[idx] otherwise (text-only request on a VLM engine so the engine reads known-zero bytes).

Parameters:

tokenIds – GPU tensor of token IDs [batchSize, seqLen].
features – Vector of deepstack feature tensors from the vision runner (may be empty).
io – Pipeline I/O – io.deepstackEmbeds[idx] is written or zeroed.
stream – CUDA stream for execution.