Shared Resources#
-
struct SharedResources#
Process-lifetime resources shared across runners.
Public Members
-
std::vector<std::unique_ptr<HybridCacheManager>> cacheManagers#
One HybridCacheManager per engine (index 0 = base, 1 = draft for SpecDecode). unique_ptr because HybridCacheManager is move-only.
-
std::unique_ptr<LoRAManager> loraManager#
-
std::unique_ptr<ExternalWeightManager> externalWeightManager#
-
std::vector<std::vector<Tensor>> kCacheViews#
Split-K/V view cache. Only populated when
cfg.useTrtNativeOps == true.KVCacheManager::getSeparateKVCachereturns views by value (built at call time from a raw pointer + offset), so the split-KV bindings still need stable addresses behind TensorMap entries. Outer index = engine index (0 = base, 1 = draft); inner index = local attention-layer index.Growth contract:
Outer vector is grown per engine the first time
buildTensorMapruns for that engine (0 -> 1 on base, 1 -> 2 on draft). At most 2 outer entries total.Inner vectors are
clear()ed and thenreserve(numAttn)’d at the start of eachbuildTensorMapcall. The subsequentpush_backs stay within the reserved capacity, so no reallocation occurs and the addresses stored in TensorMap remain stable.
Public Static Functions
- static std::unique_ptr<SharedResources> createForLLM(
- LLMEngineConfig const &cfg,
- std::unordered_map<std::string, std::string> const &loraWeightsMap,
- cudaStream_t stream
Build SharedResources for the vanilla single-engine LLM runtime (KV cache, RoPE pool, LoRA manager, external weight manager, zero buffer).
Recurrent / conv state dtypes for hybrid models are read from
cfg.recurrentStateDtype/cfg.convStateDtype— they are parsed strictly fromconfig.jsonbyparseEngineConfig.The returned
externalWeightManageris constructed by this factory; the runtime is responsible for loading files, validating against the base engine, and publishing it to a TensorMap. This keepsSharedResourcesdecoupled fromEngineExecutor(no engine I/O or validation happens inside this factory).
- static std::unique_ptr<SharedResources> createForSpecDecode(
- DeploymentConfig const &bundle,
- int32_t maxRuntimeBatchSize,
- std::unordered_map<std::string, std::string> const &loraWeightsMap,
- cudaStream_t stream
Build SharedResources for a two-engine speculative-decoding runtime (base + draft KV caches, shared RoPE pool, LoRA manager, external weight manager, zero buffer).
As with
createForLLM, the returnedexternalWeightManageris constructed by this factory, and the runtime must load files, validate against the base engine, and publish it to a TensorMap. External weights currently apply to the base engine only.
-
std::vector<std::unique_ptr<HybridCacheManager>> cacheManagers#
- void trt_edgellm::rt::allocateZeroBuffer(
- SharedResources &res,
- int64_t bytes