Deepstack Binding#
-
class DeepstackBinding#
Encapsulates the two-mode binding for Qwen3-VL / Qwen3-Omni deepstack engine inputs.
The engine graph elementwise-adds
deepstack_embeds_dtohidden_statesinside every decoder-layer forward, so something must be bound each call even when the request has no vision features. This class owns that swap:useRealFeatures(map)bindsio.deepstackEmbeds[i](real per-request features, populated by the embedding preprocessor on prefill).useZeroTarget(map)binds a shared zero buffer owned bySharedResources::zeroBuffer. The buffer is sized to the worst-case resolved shape ({maxBatch, maxDeepstackSeqLen, hiddenSize}HALF) so TRT’s read falls within the allocation regardless of the per-stepbatch/seqLenresolved fromInferenceDims. Zero contents make the engine’shidden_states + deepstackelementwise add a no-op.
The spec runtime speaks intent (verbs), never tensor names. Name templating (
deepstack_embeds_0,_1, …) stays inside this class.Public Functions
- DeepstackBinding( )#
Construct, capturing references to the per-request real-feature buffers and the shared zero target tensor. Both references must outlive every
useRealFeatures/useZeroTargetcall.
-
void useRealFeatures(TensorMap &map)#
Bind each deepstack_embeds_d entry to the corresponding real-feature buffer. Call before base prefill.
-
void useZeroTarget(TensorMap &map)#
Bind every deepstack_embeds_d entry to the shared zero target tensor. Call before every non-prefill engine execute on the base side (vanilla decode, tree verify, CUDA-graph capture).
-
std::vector<std::string> ownedNames() const#
Enumerate every binding name this feature owns. Used by TensorMap validation to assert: every map entry is covered by either the TensorRegistry, a MutableBinding, or LoRA.
-
std::string currentModeName() const#
Diagnostic: current mode as a human-readable string.
-
inline int32_t numFeatures() const noexcept#
Number of deepstack features (==
cfg.numDeepstackFeatures).