Shared Resources#

struct SharedResources#

Process-lifetime resources shared across runners.

Public Members

std::vector<std::unique_ptr<HybridCacheManager>> cacheManagers#

One HybridCacheManager per engine (index 0 = base, 1 = draft for SpecDecode). unique_ptr because HybridCacheManager is move-only.

RopeCache ropePool#
std::unique_ptr<LoRAManager> loraManager#
std::unique_ptr<ExternalWeightManager> externalWeightManager#
Tensor zeroBuffer#
std::vector<std::vector<Tensor>> kCacheViews#

Split-K/V view cache. Only populated when cfg.useTrtNativeOps == true. KVCacheManager::getSeparateKVCache returns views by value (built at call time from a raw pointer + offset), so the split-KV bindings still need stable addresses behind TensorMap entries. Outer index = engine index (0 = base, 1 = draft); inner index = local attention-layer index.

Growth contract:

  • Outer vector is grown per engine the first time buildTensorMap runs for that engine (0 -> 1 on base, 1 -> 2 on draft). At most 2 outer entries total.

  • Inner vectors are clear()ed and then reserve(numAttn)’d at the start of each buildTensorMap call. The subsequent push_backs stay within the reserved capacity, so no reallocation occurs and the addresses stored in TensorMap remain stable.

std::vector<std::vector<Tensor>> vCacheViews#

Public Static Functions

static std::unique_ptr<SharedResources> createForLLM(
LLMEngineConfig const &cfg,
std::unordered_map<std::string, std::string> const &loraWeightsMap,
cudaStream_t stream
)#

Build SharedResources for the vanilla single-engine LLM runtime (KV cache, RoPE pool, LoRA manager, external weight manager, zero buffer).

Recurrent / conv state dtypes for hybrid models are read from cfg.recurrentStateDtype / cfg.convStateDtype — they are parsed strictly from config.json by parseEngineConfig.

The returned externalWeightManager is constructed by this factory; the runtime is responsible for loading files, validating against the base engine, and publishing it to a TensorMap. This keeps SharedResources decoupled from EngineExecutor (no engine I/O or validation happens inside this factory).

static std::unique_ptr<SharedResources> createForSpecDecode(
DeploymentConfig const &bundle,
int32_t maxRuntimeBatchSize,
std::unordered_map<std::string, std::string> const &loraWeightsMap,
cudaStream_t stream
)#

Build SharedResources for a two-engine speculative-decoding runtime (base + draft KV caches, shared RoPE pool, LoRA manager, external weight manager, zero buffer).

As with createForLLM, the returned externalWeightManager is constructed by this factory, and the runtime must load files, validate against the base engine, and publish it to a TensorMap. External weights currently apply to the base engine only.

void trt_edgellm::rt::allocateZeroBuffer(
SharedResources &res,
int64_t bytes
)#