Pipeline Io#

struct PipelineIO#

All tensors flowing through the inference pipeline. POINTER STABILITY INVARIANT: After buildTensorMap() is called, this struct must not be moved, and deepstackEmbeds must not be resized. TensorMap holds Tensor* pointers into these members — any reallocation invalidates them.

Public Members

Tensor inputsEmbeds#

Tensor outputLogits#

Tensor selectTokenIndices#

Tensor contextLengths#: GPU.

Tensor hostContextLengths#: CPU (pinned, [maxBatch] INT32)

Tensor hostSelectTokenIndices#: CPU (pinned, [maxBatch, 1] INT64) — pairs with selectTokenIndices for H2D staging.

std::vector<Tensor> deepstackEmbeds#

Tensor mropeCosSin#

Tensor baseHiddenStates#

Tensor draftHiddenStatesIn#

Tensor draftHiddenStatesOut#

Tensor outputHiddenStates#

Tensor prefillEmbedsBackup#

Tensor packedAttentionMask#: Packed proposal attention mask, [batch, proposalSize, divUp(proposalSize, 32)] INT32. Written by proposal/verify input preparation kernels; consumed by the base and draft engines via the kAttentionMask binding.

Tensor specDecodePositionIds#: SpecDecode position IDs, [batch, proposalSize] INT32. Written by proposal/verify input preparation kernels; consumed by the base and draft engines via the kAttentionPosId binding.

Public Static Functions

static PipelineIO createForLLM( LLMEngineConfig const &cfg, cudaStream_t stream )#: Build PipelineIO for the vanilla single-engine LLM runtime (basic I/O tensors, deepstack embeds, MRope cos/sin cache).

static PipelineIO createForSpecDecode( DeploymentConfig const &bundle, int32_t maxRuntimeBatchSize, cudaStream_t stream )#: Build PipelineIO for a two-engine speculative-decoding runtime (basic I/O, hidden states, deepstack embeds, MRope cos/sin cache).

void trt_edgellm::rt::allocateBasicIO( PipelineIO &io, int32_t maxBatch, int32_t maxSeq, int32_t hiddenSize, int32_t vocabSize, nvinfer1::DataType dtype )#

void trt_edgellm::rt::allocateDeepstackEmbeds( PipelineIO &io, int32_t numFeatures, int32_t maxBatch, int32_t maxSeq, int32_t hiddenSize, nvinfer1::DataType dtype )#

void trt_edgellm::rt::allocateSpecDecodeHiddenStates( PipelineIO &io, int32_t maxBatch, int32_t maxSeq, int32_t baseHiddenDim, int32_t draftHiddenDim, nvinfer1::DataType dtype )#

void trt_edgellm::rt::allocateMRope( PipelineIO &io, int32_t maxBatch, int32_t maxKVCacheCapacity, int32_t rotaryDim )#

void trt_edgellm::rt::buildTensorMap( TensorMap &map, PipelineIO &io, SharedResources &res, LLMEngineConfig const &cfg, int32_t kvCacheIndex )#

Populate a TensorMap from PipelineIO + SharedResources for engine binding.

This is the critical glue function that wires all allocated tensors into the name-to-pointer map consumed by TensorRegistry::bindAll().

Parameters:

map – Output map to populate.
io – Pipeline I/O tensors.
res – Shared resources (KV caches, RoPE pool, LoRA, zero buffer).
cfg – Engine configuration.
kvCacheIndex – Index into res.cacheManagers for the target engine.

void trt_edgellm::rt::buildTensorMapForSpecDecodeDraft( TensorMap &map, PipelineIO &io, SharedResources &res, LLMEngineConfig const &cfg )#

Populate a TensorMap for a SpecDecode draft engine. Delegates to buildTensorMap with kvCacheIndex=1 for the common bindings, then patches in draft-engine- specific bindings (base/draft hidden states in+out, packed proposal attention mask, proposal position IDs).

Preconditions: io must have been constructed via PipelineIO::createForSpecDecode (baseHiddenStates / draftHiddenStatesIn/Out / packedAttentionMask / specDecodePositionIds populated).

Parameters:

map – Output map for the draft engine’s bindings.
io – Pipeline I/O (must be the SpecDecode-flavoured one).
res – Shared resources.
cfg – Draft engine configuration.