Pipeline Io#

struct PipelineIO#

All tensors flowing through the inference pipeline. POINTER STABILITY INVARIANT: After buildTensorMap() is called, this struct must not be moved, and deepstackEmbeds must not be resized. TensorMap holds Tensor* pointers into these members — any reallocation invalidates them.

Public Members

Tensor inputsEmbeds#
Tensor outputLogits#
Tensor selectTokenIndices#
Tensor contextLengths#

GPU.

Tensor hostContextLengths#

CPU (pinned, [maxBatch] INT32)

Tensor hostSelectTokenIndices#

CPU (pinned, [maxBatch, 1] INT64) — pairs with selectTokenIndices for H2D staging.

std::vector<Tensor> deepstackEmbeds#
Tensor mropeCosSin#
Tensor baseHiddenStates#
Tensor draftHiddenStatesIn#
Tensor draftHiddenStatesOut#
Tensor outputHiddenStates#
Tensor prefillEmbedsBackup#
Tensor packedAttentionMask#

Packed proposal attention mask, [batch, proposalSize, divUp(proposalSize, 32)] INT32. Written by proposal/verify input preparation kernels; consumed by the base and draft engines via the kAttentionMask binding.

Tensor specDecodePositionIds#

SpecDecode position IDs, [batch, proposalSize] INT32. Written by proposal/verify input preparation kernels; consumed by the base and draft engines via the kAttentionPosId binding.

Public Static Functions

static PipelineIO createForLLM(
LLMEngineConfig const &cfg,
cudaStream_t stream
)#

Build PipelineIO for the vanilla single-engine LLM runtime (basic I/O tensors, deepstack embeds, MRope cos/sin cache).

static PipelineIO createForSpecDecode(
DeploymentConfig const &bundle,
int32_t maxRuntimeBatchSize,
cudaStream_t stream
)#

Build PipelineIO for a two-engine speculative-decoding runtime (basic I/O, hidden states, deepstack embeds, MRope cos/sin cache).

void trt_edgellm::rt::allocateBasicIO(
PipelineIO &io,
int32_t maxBatch,
int32_t maxSeq,
int32_t hiddenSize,
int32_t vocabSize,
nvinfer1::DataType dtype
)#
void trt_edgellm::rt::allocateDeepstackEmbeds(
PipelineIO &io,
int32_t numFeatures,
int32_t maxBatch,
int32_t maxSeq,
int32_t hiddenSize,
nvinfer1::DataType dtype
)#
void trt_edgellm::rt::allocateSpecDecodeHiddenStates(
PipelineIO &io,
int32_t maxBatch,
int32_t maxSeq,
int32_t baseHiddenDim,
int32_t draftHiddenDim,
nvinfer1::DataType dtype
)#
void trt_edgellm::rt::allocateMRope(
PipelineIO &io,
int32_t maxBatch,
int32_t maxKVCacheCapacity,
int32_t rotaryDim
)#
void trt_edgellm::rt::buildTensorMap(
TensorMap &map,
PipelineIO &io,
SharedResources &res,
LLMEngineConfig const &cfg,
int32_t kvCacheIndex
)#

Populate a TensorMap from PipelineIO + SharedResources for engine binding.

This is the critical glue function that wires all allocated tensors into the name-to-pointer map consumed by TensorRegistry::bindAll().

Parameters:
  • map – Output map to populate.

  • io – Pipeline I/O tensors.

  • res – Shared resources (KV caches, RoPE pool, LoRA, zero buffer).

  • cfg – Engine configuration.

  • kvCacheIndex – Index into res.cacheManagers for the target engine.

void trt_edgellm::rt::buildTensorMapForSpecDecodeDraft(
TensorMap &map,
PipelineIO &io,
SharedResources &res,
LLMEngineConfig const &cfg
)#

Populate a TensorMap for a SpecDecode draft engine. Delegates to buildTensorMap with kvCacheIndex=1 for the common bindings, then patches in draft-engine- specific bindings (base/draft hidden states in+out, packed proposal attention mask, proposal position IDs).

Preconditions: io must have been constructed via PipelineIO::createForSpecDecode (baseHiddenStates / draftHiddenStatesIn/Out / packedAttentionMask / specDecodePositionIds populated).

Parameters:
  • map – Output map for the draft engine’s bindings.

  • io – Pipeline I/O (must be the SpecDecode-flavoured one).

  • res – Shared resources.

  • cfg – Draft engine configuration.