
usage: trtllm-build [-h] [--checkpoint_dir CHECKPOINT_DIR] [--model_config MODEL_CONFIG]
                    [--build_config BUILD_CONFIG] [--model_cls_file MODEL_CLS_FILE]
                    [--model_cls_name MODEL_CLS_NAME] [--output_dir OUTPUT_DIR] [--max_batch_size MAX_BATCH_SIZE]
                    [--max_input_len MAX_INPUT_LEN] [--max_seq_len MAX_SEQ_LEN] [--max_beam_width MAX_BEAM_WIDTH]
                    [--max_num_tokens MAX_NUM_TOKENS] [--opt_num_tokens OPT_NUM_TOKENS]
                    [--max_encoder_input_len MAX_ENCODER_INPUT_LEN]
                    [--max_prompt_embedding_table_size MAX_PROMPT_EMBEDDING_TABLE_SIZE]
                    [--kv_cache_type KV_CACHE_TYPE] [--paged_kv_cache PAGED_KV_CACHE]
                    [--input_timing_cache INPUT_TIMING_CACHE] [--output_timing_cache OUTPUT_TIMING_CACHE]
                    [--builder_opt {0,1,2,3,4,5}] [--profiling_verbosity {layer_names_only,detailed,none}]
                    [--builder_force_num_profiles BUILDER_FORCE_NUM_PROFILES] [--strip_plan] [--weight_sparsity]
                    [--weight_streaming] [--fast_build] [--workers WORKERS]
                    [--log_level {internal_error,error,warning,info,verbose,debug}] [--enable_debug_output]
                    [--visualize_network] [--dry_run] [--logits_dtype {float16,float32}] [--gather_context_logits]
                    [--gather_generation_logits] [--gather_all_token_logits] [--lora_dir LORA_DIR [LORA_DIR ...]]
                    [--lora_ckpt_source {hf,nemo}]
                    [--lora_target_modules {attn_qkv,attn_q,attn_k,attn_v,attn_dense,mlp_h_to_4h,mlp_4h_to_h,mlp_gate,cross_attn_qkv,cross_attn_q,cross_attn_k,cross_attn_v,cross_attn_dense,moe_h_to_4h,moe_4h_to_h,moe_gate,moe_router,mlp_router} [{attn_qkv,attn_q,attn_k,attn_v,attn_dense,mlp_h_to_4h,mlp_4h_to_h,mlp_gate,cross_attn_qkv,cross_attn_q,cross_attn_k,cross_attn_v,cross_attn_dense,moe_h_to_4h,moe_4h_to_h,moe_gate,moe_router,mlp_router} ...]]
                    [--max_lora_rank MAX_LORA_RANK]
                    [--speculative_decoding_mode {draft_tokens_external,lookahead_decoding,medusa,explicit_draft_tokens}]
                    [--max_draft_len MAX_DRAFT_LEN] [--auto_parallel AUTO_PARALLEL] [--gpus_per_node GPUS_PER_NODE]
                    [--cluster_key {A100-SXM-80GB,A100-SXM-40GB,A100-PCIe-80GB,A100-PCIe-40GB,H100-SXM,H100-PCIe,H20,V100-PCIe-16GB,V100-PCIe-32GB,V100-SXM-16GB,V100-SXM-32GB,V100S-PCIe,A40,A30,A10,A10G,L40S,L40,L20,L4,L2}]
                    [--bert_attention_plugin {auto,float16,float32,bfloat16,int32,disable}]
                    [--gpt_attention_plugin {auto,float16,float32,bfloat16,int32,disable}]
                    [--gemm_plugin {auto,float16,float32,bfloat16,int32,fp8,disable}]
                    [--gemm_swiglu_plugin {fp8,disable}]
                    [--fp8_rowwise_gemm_plugin {auto,float16,float32,bfloat16,int32,disable}]
                    [--nccl_plugin {auto,float16,float32,bfloat16,int32,disable}]
                    [--lookup_plugin {auto,float16,float32,bfloat16,int32,disable}]
                    [--lora_plugin {auto,float16,float32,bfloat16,int32,disable}]
                    [--moe_plugin {auto,float16,float32,bfloat16,int32,disable}]
                    [--mamba_conv1d_plugin {auto,float16,float32,bfloat16,int32,disable}]
                    [--low_latency_gemm_plugin {fp8,disable}] [--context_fmha {enable,disable}]
                    [--bert_context_fmha_fp32_acc {enable,disable}] [--remove_input_padding {enable,disable}]
                    [--reduce_fusion {enable,disable}] [--enable_xqa {enable,disable}]
                    [--tokens_per_block TOKENS_PER_BLOCK] [--use_paged_context_fmha {enable,disable}]
                    [--use_fp8_context_fmha {enable,disable}] [--multiple_profiles {enable,disable}]
                    [--paged_state {enable,disable}] [--streamingllm {enable,disable}]
                    [--use_fused_mlp {enable,disable}]

Named Arguments


The directory path that contains TensorRT-LLM checkpoint.


The file path that saves TensorRT-LLM checkpoint config.


The file path that saves TensorRT-LLM build config.


The file path that defines customized TensorRT-LLM model.


The customized TensorRT-LLM model class name.


The directory path to save the serialized engine files and engine config file.

Default: 'engine_outputs'


Maximum number of requests that the engine can schedule.

Default: 2048


Maximum input length of one request.

Default: 1024

--max_seq_len, --max_decoder_seq_len

Maximum total length of one request, including prompt and outputs. If unspecified, the value is deduced from the model config.


Maximum number of beams for beam search decoding.

Default: 1


Maximum number of batched input tokens after padding is removed in each batch. Currently, the input padding is removed by default; you may explicitly disable it by specifying --remove_input_padding disable.

Default: 8192


Optimal number of batched input tokens after padding is removed in each batch It equals to max_batch_size * max_beam_width by default, set this value as close as possible to the actual number of tokens on your workload. Note that this argument might be removed in the future.


Maximum encoder input length for enc-dec models. Set max_input_len to 1 to start generation from decoder_start_token_id of length 1.

Default: 1024

--max_prompt_embedding_table_size, --max_multimodal_len

Maximum prompt embedding table size for prompt tuning, or maximum multimodal input size for multimodal models. Setting a value > 0 enables prompt tuning or multimodal input.

Default: 0


Set KV cache type (continuous, paged, or disabled). For disabled case, KV cache is disabled and only context phase is allowed.


Deprecated. Enabling this option is equvilient to --kv_cache_type paged for transformer based models.


The file path to read the timing cache. This option is ignored if the file does not exist.


The file path to write the timing cache.

Default: 'model.cache'


Possible choices: 0, 1, 2, 3, 4, 5

TensorRT builder optimization level.


Possible choices: layer_names_only, detailed, none

The profiling verbosity for the generated TensorRT engine. Setting to detailed allows inspecting tactic choices and kernel parameters.

Default: 'layer_names_only'


If specified, force to use the number of profiles.


Enable stripping weights from the final TensorRT engine under the assumption that the refit weights are identical to those provided at build time.

Default: False


Enable weight sparsity.

Default: False


Enable offloading weights to CPU and streaming loading at runtime.

Default: False


Enable features for faster engine building. This may cause some performance degradation and is currently incompatible with int8/int4 quantization.

Default: False


The number of workers for building in parallel.

Default: 1


Possible choices: internal_error, error, warning, info, verbose, debug

The logging level.

Default: 'info'


Enable debug output.

Default: False


Export TensorRT Networks to ONNX prior to Engine build for debugging.

Default: False


Run through the build process except the actual Engine build for debugging.

Default: False

Logits arguments


Possible choices: float16, float32

The data type of logits.


Enable gathering context logits.

Default: False


Enable gathering generation logits.

Default: False


Enable both gather_context_logits and gather_generation_logits.

Default: False

LoRA arguments


The directory of LoRA weights. If multiple directories are provided, the first one is used for configuration.


Possible choices: hf, nemo

The source type of LoRA checkpoint.

Default: 'hf'


Possible choices: attn_qkv, attn_q, attn_k, attn_v, attn_dense, mlp_h_to_4h, mlp_4h_to_h, mlp_gate, cross_attn_qkv, cross_attn_q, cross_attn_k, cross_attn_v, cross_attn_dense, moe_h_to_4h, moe_4h_to_h, moe_gate, moe_router, mlp_router

The target module names that LoRA is applied. Only effective when lora_plugin is enabled.


Maximum LoRA rank for different LoRA modules. It is used to compute the workspace size of LoRA plugin.

Default: 64

Speculative decoding arguments


Possible choices: draft_tokens_external, lookahead_decoding, medusa, explicit_draft_tokens

Mode of speculative decoding.


Maximum lengths of draft tokens for speculative decoding target model.

Default: 0

Auto parallel arguments


MPI world size for auto parallel.

Default: 1


Number of GPUs each node has in a multi-node setup. This is a cluster spec and can be greater/smaller than world size. This option is only used for auto parallel specified with --auto_parallel.

Default: 8


Possible choices: A100-SXM-80GB, A100-SXM-40GB, A100-PCIe-80GB, A100-PCIe-40GB, H100-SXM, H100-PCIe, H20, V100-PCIe-16GB, V100-PCIe-32GB, V100-SXM-16GB, V100-SXM-32GB, V100S-PCIe, A40, A30, A10, A10G, L40S, L40, L20, L4, L2

Unique name for target GPU type. Inferred from current GPU type if not specified. This option is only used for auto parallel specified with --auto_parallel.

Plugin config arguments


Possible choices: auto, float16, float32, bfloat16, int32, disable

Whether to enable/disable bert_attention_plugin and the dtype.

Default: 'auto'


Possible choices: auto, float16, float32, bfloat16, int32, disable

Whether to enable/disable gpt_attention_plugin and the dtype.

Default: 'auto'


Possible choices: auto, float16, float32, bfloat16, int32, fp8, disable

Whether to enable/disable gemm_plugin and the dtype.

Default: 'disable'


Possible choices: fp8, disable

Whether to enable/disable gemm_swiglu_plugin and the dtype.

Default: 'disable'


Possible choices: auto, float16, float32, bfloat16, int32, disable

Whether to enable/disable fp8_rowwise_gemm_plugin and the dtype.

Default: 'disable'


Possible choices: auto, float16, float32, bfloat16, int32, disable

Whether to enable/disable nccl_plugin and the dtype.

Default: 'auto'


Possible choices: auto, float16, float32, bfloat16, int32, disable

Whether to enable/disable lookup_plugin and the dtype.

Default: 'disable'


Possible choices: auto, float16, float32, bfloat16, int32, disable

Whether to enable/disable lora_plugin and the dtype.

Default: 'disable'


Possible choices: auto, float16, float32, bfloat16, int32, disable

Whether to enable/disable moe_plugin and the dtype.

Default: 'auto'


Possible choices: auto, float16, float32, bfloat16, int32, disable

Whether to enable/disable mamba_conv1d_plugin and the dtype.

Default: 'auto'


Possible choices: fp8, disable

Whether to enable/disable low_latency_gemm_plugin and the dtype.

Default: 'disable'


Possible choices: enable, disable

Whether to enable/disable context_fmha.

Default: 'enable'


Possible choices: enable, disable

Whether to enable/disable bert_context_fmha_fp32_acc.

Default: 'disable'


Possible choices: enable, disable

Whether to enable/disable remove_input_padding.

Default: 'enable'


Possible choices: enable, disable

Whether to enable/disable reduce_fusion.

Default: 'disable'


Possible choices: enable, disable

Whether to enable/disable enable_xqa.

Default: 'enable'



Default: 64


Possible choices: enable, disable

Whether to enable/disable use_paged_context_fmha.

Default: 'disable'


Possible choices: enable, disable

Whether to enable/disable use_fp8_context_fmha.

Default: 'disable'


Possible choices: enable, disable

Whether to enable/disable multiple_profiles.

Default: 'disable'


Possible choices: enable, disable

Whether to enable/disable paged_state.

Default: 'enable'


Possible choices: enable, disable

Whether to enable/disable streamingllm.

Default: 'disable'


Possible choices: enable, disable

Whether to enable/disable use_fused_mlp.

Default: 'enable'