Engine Executor#

class EngineExecutor#

Thin TRT wrapper with prepare/execute split.

EngineExecutor owns a TRT runtime, engine, and execution context. It replaces both LLMEngineRunner and EagleDraftEngineRunner with a single model-agnostic wrapper (~300 LOC).

prepare() sets the optimization profile and delegates binding to TensorRegistry::bindAll().
execute() replays a cached CUDA graph when the binding state matches, otherwise falls back to enqueueV3.
captureGraph() captures a new CUDA graph for the current bindings.

EngineExecutor knows nothing about models, phases, or features.

Public Functions

~EngineExecutor() noexcept#: Destructor — destroys all captured CUDA graphs.

EngineExecutor(EngineExecutor const&) = delete#

EngineExecutor &operator=(EngineExecutor const&) = delete#

bool prepare( int32_t profileIndex, InferenceDims const &dims, TensorMap const &map, cudaStream_t stream )#

Switch optimization profile, resolve shapes, bind all tensors.

Parameters:

profileIndex – TRT optimization profile index
dims – Symbolic dimension values for this step
map – Name-to-tensor mapping
stream – CUDA stream for the async profile switch

Returns:

True on success

bool execute(cudaStream_t stream)#

Execute inference.

Replays a cached CUDA graph if one matches the current bindings, otherwise falls back to enqueueV3.

Parameters:: stream – CUDA stream
Returns:: True on success

bool captureGraph(cudaStream_t stream)#

Capture a CUDA graph for the current binding state (after prepare()).

Performs a warmup enqueue, then captures via cudaStreamBeginCapture. The captured graph is keyed by a binding hash with full snapshot verification.

Parameters:: stream – CUDA stream (must not be the default stream)
Returns:: True if capture succeeded

int64_t getRequiredContextMemorySize() const#

Query required device memory for the execution context.

Returns:: Required memory size in bytes

bool setContextMemory(Tensor &sharedMem)#

Provide shared device memory for the execution context.

Parameters:: sharedMem – Tensor whose memory will back the TRT context
Returns:: True on success

int32_t getNumIOTensors() const#: Return the number of I/O tensors in the engine.

char const *getIOTensorName(int32_t index) const#: Return the name of the i-th I/O tensor.

nvinfer1::DataType getBindingDataType(char const *name) const#: Return the data type of a named binding.

nvinfer1::Dims getProfileShape( char const *name, int32_t profileIndex, nvinfer1::OptProfileSelector selector ) const#: Return a profile shape (min/opt/max) for a named binding.

nvinfer1::ICudaEngine const &getEngine() const noexcept#: Access the underlying TRT engine for generic introspection.

Public Static Functions

static std::unique_ptr<EngineExecutor> createForLLM( std::filesystem::path const &enginePath, LLMEngineConfig const &cfg, std::optional<int32_t> specDecodeBaseOutputHiddenDim = std::nullopt )#: Build an EngineExecutor for a vanilla single-engine LLM or a SpecDecode base engine. The factory builds the TensorRegistry internally via buildRegistryForLLM(cfg).

static std::unique_ptr<EngineExecutor> createForSpecDecodeDraft( std::filesystem::path const &enginePath, DeploymentConfig const &bundle )#: Build an EngineExecutor for the SpecDecode draft engine. The factory builds the TensorRegistry internally via buildRegistryForSpecDecodeDraft(bundle).

struct BindingSnapshot#

Snapshot of all binding addresses and shapes — used for graph-cache verification.

Public Functions

bool operator==(BindingSnapshot const &rhs) const noexcept#

Public Members

std::vector<std::pair<uintptr_t, nvinfer1::Dims>> bindings#

struct BindingSnapshot

Snapshot of all binding addresses and shapes — used for graph-cache verification.

Public Functions

bool operator==(BindingSnapshot const &rhs) const noexcept

Public Members

std::vector<std::pair<uintptr_t, nvinfer1::Dims>> bindings