Embedding Preprocessor#
-
class EmbeddingPreprocessor#
Reusable preprocessor that wraps all embedding-lookup kernel calls.
Given token IDs (and optional multimodal embeddings) it writes dense vectors into
PipelineIO::inputsEmbeds(and deepstack slots when applicable). The class is intentionally stateless beyond the configuration references so that it can be shared across prefill / decode / system-prompt-cache paths.Public Functions
- EmbeddingPreprocessor(
- EmbeddingData const &embedding,
- LLMEngineConfig const &config
Construct with the embedding table data and engine configuration.
Both references must outlive the preprocessor.
- void embed(
- Tensor const &tokenIds,
- OptionalInputTensor visionEmbeds,
- OptionalInputTensor audioEmbeds,
- PipelineIO &io,
- cudaStream_t stream
Embed token IDs into dense vectors, optionally inserting multimodal embeddings.
Dispatches to one of three kernel paths depending on the inputs:
Explicit-id multimodal path (
kernel::embeddingLookupMultimodal) when audio is present, or when vision is present on an audio-capable model family (Nemotron-Omni / Qwen3-Omni keep<image>in-stream;mConfig.audioTokenId >= 0identifies these).Legacy vision path (
kernel::embeddingLookupWithImageInsertion) for vision-only families that remap image tokens asvocabSize + k(Qwen2.5-VL, InternVL;audioTokenId == -1).Text-only path (
kernel::embeddingLookup) for pure-text requests.
- Parameters:
tokenIds – GPU tensor of token IDs [batchSize, seqLen].
visionEmbeds – Optional vision (image) embeddings.
audioEmbeds – Optional audio embeddings.
io – Pipeline I/O –
inputsEmbedsis written.stream – CUDA stream for execution.
- OptionalInputTensors assembleDeepstack(
- Tensor const &tokenIds,
- OptionalInputTensors const &features,
- PipelineIO &io,
- cudaStream_t stream
Assemble deepstack features at image placeholder positions.
For each feature in
features, callskernel::assembleDeepstackEmbeddingand writes into the corresponding slot inio.deepstackEmbeds.- Parameters:
tokenIds – GPU tensor of token IDs [batchSize, seqLen].
features – Vector of deepstack feature tensors from the vision runner.
io – Pipeline I/O –
deepstackEmbeds[i]is written.stream – CUDA stream for execution.
- Returns:
Vector of const references suitable for engine binding.
- void prepareDeepstack(
- Tensor const &tokenIds,
- OptionalInputTensors const &features,
- PipelineIO &io,
- cudaStream_t stream
Prepare deepstack slots for the current step.
Encapsulates the “config has deepstack, features present or missing?” policy so the runtime does not need to gate on
numDeepstackFeatures:no-op when
mConfig.numDeepstackFeatures == 0(non-VLM engine);assembles real features via
assembleDeepstackwhenfeaturesis non-empty;zero-fills
io.deepstackEmbeds[idx]otherwise (text-only request on a VLM engine so the engine reads known-zero bytes).
- Parameters:
tokenIds – GPU tensor of token IDs [batchSize, seqLen].
features – Vector of deepstack feature tensors from the vision runner (may be empty).
io – Pipeline I/O –
io.deepstackEmbeds[idx]is written or zeroed.stream – CUDA stream for execution.