Embedding Preprocessor#

class EmbeddingPreprocessor#

Reusable preprocessor that wraps all embedding-lookup kernel calls.

Given token IDs (and optional multimodal embeddings) it writes dense vectors into PipelineIO::inputsEmbeds (and deepstack slots when applicable). The class is intentionally stateless beyond the configuration references so that it can be shared across prefill / decode / system-prompt-cache paths.

Public Functions

EmbeddingPreprocessor(
EmbeddingData const &embedding,
LLMEngineConfig const &config
)#

Construct with the embedding table data and engine configuration.

Both references must outlive the preprocessor.

void embed(
Tensor const &tokenIds,
OptionalInputTensor visionEmbeds,
OptionalInputTensor audioEmbeds,
PipelineIO &io,
cudaStream_t stream
)#

Embed token IDs into dense vectors, optionally inserting multimodal embeddings.

Dispatches to one of three kernel paths depending on the inputs:

  1. Explicit-id multimodal path (kernel::embeddingLookupMultimodal) when audio is present, or when vision is present on an audio-capable model family (Nemotron-Omni / Qwen3-Omni keep <image> in-stream; mConfig.audioTokenId >= 0 identifies these).

  2. Legacy vision path (kernel::embeddingLookupWithImageInsertion) for vision-only families that remap image tokens as vocabSize + k (Qwen2.5-VL, InternVL; audioTokenId == -1).

  3. Text-only path (kernel::embeddingLookup) for pure-text requests.

Parameters:
  • tokenIds – GPU tensor of token IDs [batchSize, seqLen].

  • visionEmbeds – Optional vision (image) embeddings.

  • audioEmbeds – Optional audio embeddings.

  • io – Pipeline I/O – inputsEmbeds is written.

  • stream – CUDA stream for execution.

OptionalInputTensors assembleDeepstack(
Tensor const &tokenIds,
OptionalInputTensors const &features,
PipelineIO &io,
cudaStream_t stream
)#

Assemble deepstack features at image placeholder positions.

For each feature in features, calls kernel::assembleDeepstackEmbedding and writes into the corresponding slot in io.deepstackEmbeds.

Parameters:
  • tokenIds – GPU tensor of token IDs [batchSize, seqLen].

  • features – Vector of deepstack feature tensors from the vision runner.

  • io – Pipeline I/O – deepstackEmbeds[i] is written.

  • stream – CUDA stream for execution.

Returns:

Vector of const references suitable for engine binding.

void prepareDeepstack(
Tensor const &tokenIds,
OptionalInputTensors const &features,
PipelineIO &io,
cudaStream_t stream
)#

Prepare deepstack slots for the current step.

Encapsulates the “config has deepstack, features present or missing?” policy so the runtime does not need to gate on numDeepstackFeatures:

  • no-op when mConfig.numDeepstackFeatures == 0 (non-VLM engine);

  • assembles real features via assembleDeepstack when features is non-empty;

  • zero-fills io.deepstackEmbeds[idx] otherwise (text-only request on a VLM engine so the engine reads known-zero bytes).

Parameters:
  • tokenIds – GPU tensor of token IDs [batchSize, seqLen].

  • features – Vector of deepstack feature tensors from the vision runner (may be empty).

  • io – Pipeline I/O – io.deepstackEmbeds[idx] is written or zeroed.

  • stream – CUDA stream for execution.