Engine Executor#

class EngineExecutor#

Thin TRT wrapper with prepare/execute split.

EngineExecutor owns a TRT runtime, engine, and execution context. It replaces both LLMEngineRunner and EagleDraftEngineRunner with a single model-agnostic wrapper (~300 LOC).

EngineExecutor knows nothing about models, phases, or features.

Public Functions

~EngineExecutor() noexcept#

Destructor — destroys all captured CUDA graphs.

EngineExecutor(EngineExecutor const&) = delete#
EngineExecutor &operator=(EngineExecutor const&) = delete#
bool prepare(
int32_t profileIndex,
InferenceDims const &dims,
TensorMap const &map,
cudaStream_t stream
)#

Switch optimization profile, resolve shapes, bind all tensors.

Parameters:
  • profileIndex – TRT optimization profile index

  • dims – Symbolic dimension values for this step

  • map – Name-to-tensor mapping

  • stream – CUDA stream for the async profile switch

Returns:

True on success

bool execute(cudaStream_t stream)#

Execute inference.

Replays a cached CUDA graph if one matches the current bindings, otherwise falls back to enqueueV3.

Parameters:

stream – CUDA stream

Returns:

True on success

bool captureGraph(cudaStream_t stream)#

Capture a CUDA graph for the current binding state (after prepare()).

Performs a warmup enqueue, then captures via cudaStreamBeginCapture. The captured graph is keyed by a binding hash with full snapshot verification.

Parameters:

stream – CUDA stream (must not be the default stream)

Returns:

True if capture succeeded

int64_t getRequiredContextMemorySize() const#

Query required device memory for the execution context.

Returns:

Required memory size in bytes

bool setContextMemory(Tensor &sharedMem)#

Provide shared device memory for the execution context.

Parameters:

sharedMemTensor whose memory will back the TRT context

Returns:

True on success

int32_t getNumIOTensors() const#

Return the number of I/O tensors in the engine.

char const *getIOTensorName(int32_t index) const#

Return the name of the i-th I/O tensor.

nvinfer1::DataType getBindingDataType(char const *name) const#

Return the data type of a named binding.

nvinfer1::Dims getProfileShape(
char const *name,
int32_t profileIndex,
nvinfer1::OptProfileSelector selector
) const#

Return a profile shape (min/opt/max) for a named binding.

nvinfer1::ICudaEngine const &getEngine() const noexcept#

Access the underlying TRT engine for generic introspection.

Public Static Functions

static std::unique_ptr<EngineExecutor> createForLLM(
std::filesystem::path const &enginePath,
LLMEngineConfig const &cfg,
std::optional<int32_t> specDecodeBaseOutputHiddenDim = std::nullopt
)#

Build an EngineExecutor for a vanilla single-engine LLM or a SpecDecode base engine. The factory builds the TensorRegistry internally via buildRegistryForLLM(cfg).

static std::unique_ptr<EngineExecutor> createForSpecDecodeDraft(
std::filesystem::path const &enginePath,
DeploymentConfig const &bundle
)#

Build an EngineExecutor for the SpecDecode draft engine. The factory builds the TensorRegistry internally via buildRegistryForSpecDecodeDraft(bundle).

struct BindingSnapshot#

Snapshot of all binding addresses and shapes — used for graph-cache verification.

Public Functions

bool operator==(BindingSnapshot const &rhs) const noexcept#

Public Members

std::vector<std::pair<uintptr_t, nvinfer1::Dims>> bindings#
struct BindingSnapshot

Snapshot of all binding addresses and shapes — used for graph-cache verification.

Public Functions

bool operator==(BindingSnapshot const &rhs) const noexcept

Public Members

std::vector<std::pair<uintptr_t, nvinfer1::Dims>> bindings