Engine Executor#
-
class EngineExecutor#
Thin TRT wrapper with prepare/execute split.
EngineExecutor owns a TRT runtime, engine, and execution context. It replaces both LLMEngineRunner and EagleDraftEngineRunner with a single model-agnostic wrapper (~300 LOC).
prepare()sets the optimization profile and delegates binding to TensorRegistry::bindAll().execute()replays a cached CUDA graph when the binding state matches, otherwise falls back to enqueueV3.captureGraph()captures a new CUDA graph for the current bindings.
EngineExecutor knows nothing about models, phases, or features.
Public Functions
-
~EngineExecutor() noexcept#
Destructor — destroys all captured CUDA graphs.
-
EngineExecutor(EngineExecutor const&) = delete#
-
EngineExecutor &operator=(EngineExecutor const&) = delete#
- bool prepare(
- int32_t profileIndex,
- InferenceDims const &dims,
- TensorMap const &map,
- cudaStream_t stream
Switch optimization profile, resolve shapes, bind all tensors.
- Parameters:
profileIndex – TRT optimization profile index
dims – Symbolic dimension values for this step
map – Name-to-tensor mapping
stream – CUDA stream for the async profile switch
- Returns:
True on success
-
bool execute(cudaStream_t stream)#
Execute inference.
Replays a cached CUDA graph if one matches the current bindings, otherwise falls back to enqueueV3.
- Parameters:
stream – CUDA stream
- Returns:
True on success
-
bool captureGraph(cudaStream_t stream)#
Capture a CUDA graph for the current binding state (after prepare()).
Performs a warmup enqueue, then captures via cudaStreamBeginCapture. The captured graph is keyed by a binding hash with full snapshot verification.
- Parameters:
stream – CUDA stream (must not be the default stream)
- Returns:
True if capture succeeded
-
int64_t getRequiredContextMemorySize() const#
Query required device memory for the execution context.
- Returns:
Required memory size in bytes
-
bool setContextMemory(Tensor &sharedMem)#
Provide shared device memory for the execution context.
- Parameters:
sharedMem – Tensor whose memory will back the TRT context
- Returns:
True on success
-
int32_t getNumIOTensors() const#
Return the number of I/O tensors in the engine.
-
char const *getIOTensorName(int32_t index) const#
Return the name of the i-th I/O tensor.
-
nvinfer1::DataType getBindingDataType(char const *name) const#
Return the data type of a named binding.
- nvinfer1::Dims getProfileShape(
- char const *name,
- int32_t profileIndex,
- nvinfer1::OptProfileSelector selector
Return a profile shape (min/opt/max) for a named binding.
-
nvinfer1::ICudaEngine const &getEngine() const noexcept#
Access the underlying TRT engine for generic introspection.
Public Static Functions
- static std::unique_ptr<EngineExecutor> createForLLM(
- std::filesystem::path const &enginePath,
- LLMEngineConfig const &cfg,
- std::optional<int32_t> specDecodeBaseOutputHiddenDim = std::nullopt
Build an EngineExecutor for a vanilla single-engine LLM or a SpecDecode base engine. The factory builds the TensorRegistry internally via
buildRegistryForLLM(cfg).
- static std::unique_ptr<EngineExecutor> createForSpecDecodeDraft(
- std::filesystem::path const &enginePath,
- DeploymentConfig const &bundle
Build an EngineExecutor for the SpecDecode draft engine. The factory builds the TensorRegistry internally via
buildRegistryForSpecDecodeDraft(bundle).
-
struct BindingSnapshot#
Snapshot of all binding addresses and shapes — used for graph-cache verification.
Public Functions
-
bool operator==(BindingSnapshot const &rhs) const noexcept#
Public Members
-
std::vector<std::pair<uintptr_t, nvinfer1::Dims>> bindings#
-
bool operator==(BindingSnapshot const &rhs) const noexcept#
-
struct BindingSnapshot
Snapshot of all binding addresses and shapes — used for graph-cache verification.
Public Functions
-
bool operator==(BindingSnapshot const &rhs) const noexcept
Public Members
-
std::vector<std::pair<uintptr_t, nvinfer1::Dims>> bindings
-
bool operator==(BindingSnapshot const &rhs) const noexcept