Pipeline Io#
-
struct PipelineIO#
All tensors flowing through the inference pipeline. POINTER STABILITY INVARIANT: After buildTensorMap() is called, this struct must not be moved, and deepstackEmbeds must not be resized. TensorMap holds Tensor* pointers into these members — any reallocation invalidates them.
Public Members
-
Tensor hostSelectTokenIndices#
CPU (pinned, [maxBatch, 1] INT64) — pairs with selectTokenIndices for H2D staging.
Public Static Functions
- static PipelineIO createForLLM(
- LLMEngineConfig const &cfg,
- cudaStream_t stream
Build PipelineIO for the vanilla single-engine LLM runtime (basic I/O tensors, deepstack embeds, MRope cos/sin cache).
- static PipelineIO createForSpecDecode(
- DeploymentConfig const &bundle,
- int32_t maxRuntimeBatchSize,
- cudaStream_t stream
Build PipelineIO for a two-engine speculative-decoding runtime (basic I/O, hidden states, deepstack embeds, MRope cos/sin cache).
-
Tensor hostSelectTokenIndices#
- void trt_edgellm::rt::allocateBasicIO(
- PipelineIO &io,
- int32_t maxBatch,
- int32_t maxSeq,
- int32_t hiddenSize,
- int32_t vocabSize,
- nvinfer1::DataType dtype
- void trt_edgellm::rt::allocateDeepstackEmbeds(
- PipelineIO &io,
- int32_t numFeatures,
- int32_t maxBatch,
- int32_t maxSeq,
- int32_t hiddenSize,
- nvinfer1::DataType dtype
- void trt_edgellm::rt::allocateSpecDecodeHiddenStates(
- PipelineIO &io,
- int32_t maxBatch,
- int32_t maxSeq,
- int32_t baseHiddenDim,
- int32_t draftHiddenDim,
- nvinfer1::DataType dtype
- void trt_edgellm::rt::allocateMRope(
- PipelineIO &io,
- int32_t maxBatch,
- int32_t maxKVCacheCapacity,
- int32_t rotaryDim
- TensorMap &map,
- PipelineIO &io,
- SharedResources &res,
- LLMEngineConfig const &cfg,
- int32_t kvCacheIndex
Populate a TensorMap from PipelineIO + SharedResources for engine binding.
This is the critical glue function that wires all allocated tensors into the name-to-pointer map consumed by TensorRegistry::bindAll().
- Parameters:
map – Output map to populate.
io – Pipeline I/O tensors.
res – Shared resources (KV caches, RoPE pool, LoRA, zero buffer).
cfg – Engine configuration.
kvCacheIndex – Index into res.cacheManagers for the target engine.
- TensorMap &map,
- PipelineIO &io,
- SharedResources &res,
- LLMEngineConfig const &cfg
Populate a TensorMap for a SpecDecode draft engine. Delegates to
buildTensorMapwithkvCacheIndex=1for the common bindings, then patches in draft-engine- specific bindings (base/draft hidden states in+out, packed proposal attention mask, proposal position IDs).Preconditions:
iomust have been constructed viaPipelineIO::createForSpecDecode(baseHiddenStates / draftHiddenStatesIn/Out / packedAttentionMask / specDecodePositionIds populated).- Parameters:
map – Output map for the draft engine’s bindings.
io – Pipeline I/O (must be the SpecDecode-flavoured one).
res – Shared resources.
cfg – Draft engine configuration.