Deepstack Binding#

class DeepstackBinding#

Encapsulates the two-mode binding for Qwen3-VL / Qwen3-Omni deepstack engine inputs.

The engine graph elementwise-adds deepstack_embeds_d to hidden_states inside every decoder-layer forward, so something must be bound each call even when the request has no vision features. This class owns that swap:

useRealFeatures(map) binds io.deepstackEmbeds[i] (real per-request features, populated by the embedding preprocessor on prefill).
useZeroTarget(map) binds a shared zero buffer owned by SharedResources::zeroBuffer. The buffer is sized to the worst-case resolved shape ({maxBatch, maxDeepstackSeqLen, hiddenSize} HALF) so TRT’s read falls within the allocation regardless of the per-step batch / seqLen resolved from InferenceDims. Zero contents make the engine’s hidden_states + deepstack elementwise add a no-op.

The spec runtime speaks intent (verbs), never tensor names. Name templating (deepstack_embeds_0, _1, …) stays inside this class.

Public Functions

DeepstackBinding( std::vector<Tensor> &realBuffers, Tensor &zeroTarget )#: Construct, capturing references to the per-request real-feature buffers and the shared zero target tensor. Both references must outlive every useRealFeatures / useZeroTarget call.

void useRealFeatures(TensorMap &map)#: Bind each deepstack_embeds_d entry to the corresponding real-feature buffer. Call before base prefill.

void useZeroTarget(TensorMap &map)#: Bind every deepstack_embeds_d entry to the shared zero target tensor. Call before every non-prefill engine execute on the base side (vanilla decode, tree verify, CUDA-graph capture).

std::vector<std::string> ownedNames() const#: Enumerate every binding name this feature owns. Used by TensorMap validation to assert: every map entry is covered by either the TensorRegistry, a MutableBinding, or LoRA.

std::string currentModeName() const#: Diagnostic: current mode as a human-readable string.

inline int32_t numFeatures() const noexcept#: Number of deepstack features (== cfg.numDeepstackFeatures).