LLM Builder#

class LLMBuilder#

Builder class for Large Language Model TensorRT engines. Handles the complete process of building TensorRT engines from ONNX models for various types of LLMs including standard, speculative-decoding, and VLM models.

Public Functions

LLMBuilder(
std::filesystem::path const &onnxDir,
std::filesystem::path const &engineDir,
LLMBuilderConfig const &config
)#

Constructor for LLMBuilder.

Parameters:
  • onnxDir – Directory containing the ONNX model and configuration files

  • engineDir – Directory where the built engine and related files will be saved

  • config – Configuration object specifying build parameters

~LLMBuilder() noexcept = default#

Destructor.

bool build()#

Build the TensorRT engine from the ONNX model. This method performs the complete build process including:

  • Loading and parsing the ONNX model

  • Setting up optimization profiles

  • Building the TensorRT engine

  • Copying necessary files to the engine directory

Throws:

std::filesystem::filesystem_error – If filesystem operations fail

Returns:

true if build was successful, false otherwise

struct LLMBuilderConfig#

Configuration structure for LLM model building. Contains all parameters needed to configure the TensorRT engine building process for Large Language Models, including standard and speculative-decoding engines.

Public Functions

inline Json toJson() const#

Convert configuration to JSON format for serialization.

Returns:

JSON object containing all configuration parameters

inline std::string toString() const#

Convert configuration to human-readable string format.

Returns:

String representation of the configuration for debugging/logging

Public Members

int64_t maxInputLen = {1024}#

Maximum input sequence length for the model.

bool specDraft = {false}#

Whether this is a speculative-decoding draft model.

bool specBase = {false}#

Whether this is a speculative-decoding base model.

int64_t maxBatchSize = {4}#

Maximum batch size for inference.

int64_t maxLoraRank = {0}#

Maximum LoRA rank (0 = no LoRA support)

int64_t maxKVCacheCapacity = {4096}#

Maximum KV cache capacity (sequence length)

int64_t maxVerifyTreeSize = {60}#

Maximum length of input_ids passed into spec base model for verification.

int64_t maxDraftTreeSize = {60}#

Maximum length of input_ids passed into spec draft model for draft generation.

bool useTrtNativeOps = {false}#

Whether to use TensorRT native operations instead of custom plugin.

bool profilingDetailed = {false}#

Enable detailed profiling verbosity for layer info extraction.

Public Static Functions

static inline LLMBuilderConfig fromJson(Json const &json)#

Create configuration from JSON format.

Parameters:

json – JSON object containing configuration parameters

Throws:

nlohmann::json::type_error – If JSON value types don’t match expected types

Returns:

LLMBuilderConfig object with parsed parameters