trtllm-build
usage: trtllm-build [-h] [--checkpoint_dir CHECKPOINT_DIR] [--model_config MODEL_CONFIG] [--build_config BUILD_CONFIG]
[--model_cls_file MODEL_CLS_FILE] [--model_cls_name MODEL_CLS_NAME] [--output_dir OUTPUT_DIR]
[--max_batch_size MAX_BATCH_SIZE] [--max_input_len MAX_INPUT_LEN] [--max_seq_len MAX_SEQ_LEN]
[--max_beam_width MAX_BEAM_WIDTH] [--max_num_tokens MAX_NUM_TOKENS] [--opt_num_tokens OPT_NUM_TOKENS]
[--max_encoder_input_len MAX_ENCODER_INPUT_LEN] [--max_prompt_embedding_table_size MAX_PROMPT_EMBEDDING_TABLE_SIZE]
[--kv_cache_type KV_CACHE_TYPE] [--paged_kv_cache PAGED_KV_CACHE] [--input_timing_cache INPUT_TIMING_CACHE]
[--output_timing_cache OUTPUT_TIMING_CACHE] [--profiling_verbosity {layer_names_only,detailed,none}] [--strip_plan]
[--weight_sparsity] [--weight_streaming] [--fast_build] [--workers WORKERS]
[--log_level {internal_error,error,warning,info,verbose,debug}] [--enable_debug_output] [--visualize_network] [--dry_run]
[--monitor_memory] [--logits_dtype {float16,float32}] [--gather_context_logits] [--gather_generation_logits]
[--gather_all_token_logits] [--lora_dir LORA_DIR [LORA_DIR ...]] [--lora_ckpt_source {hf,nemo}]
[--lora_target_modules {attn_qkv,attn_q,attn_k,attn_v,attn_dense,mlp_h_to_4h,mlp_4h_to_h,mlp_gate,cross_attn_qkv,cross_attn_q,cross_attn_k,cross_attn_v,cross_attn_dense,moe_h_to_4h,moe_4h_to_h,moe_gate,moe_router,mlp_router} [{attn_qkv,attn_q,attn_k,attn_v,attn_dense,mlp_h_to_4h,mlp_4h_to_h,mlp_gate,cross_attn_qkv,cross_attn_q,cross_attn_k,cross_attn_v,cross_attn_dense,moe_h_to_4h,moe_4h_to_h,moe_gate,moe_router,mlp_router} ...]]
[--max_lora_rank MAX_LORA_RANK]
[--speculative_decoding_mode {draft_tokens_external,lookahead_decoding,medusa,explicit_draft_tokens,eagle}]
[--max_draft_len MAX_DRAFT_LEN] [--auto_parallel AUTO_PARALLEL] [--gpus_per_node GPUS_PER_NODE]
[--cluster_key {A100-SXM-80GB,A100-SXM-40GB,A100-PCIe-80GB,A100-PCIe-40GB,H100-SXM,H100-PCIe,H20,V100-PCIe-16GB,V100-PCIe-32GB,V100-SXM-16GB,V100-SXM-32GB,V100S-PCIe,A40,A30,A10,A10G,L40S,L40,L20,L4,L2}]
[--bert_attention_plugin {auto,float16,float32,bfloat16,int32,disable}]
[--gpt_attention_plugin {auto,float16,float32,bfloat16,int32,disable}]
[--gemm_plugin {auto,float16,float32,bfloat16,int32,fp8,disable}] [--gemm_swiglu_plugin {fp8,disable}]
[--fp8_rowwise_gemm_plugin {auto,float16,float32,bfloat16,int32,disable}]
[--nccl_plugin {auto,float16,float32,bfloat16,int32,disable}] [--lora_plugin {auto,float16,float32,bfloat16,int32,disable}]
[--moe_plugin {auto,float16,float32,bfloat16,int32,disable}]
[--mamba_conv1d_plugin {auto,float16,float32,bfloat16,int32,disable}] [--low_latency_gemm_plugin {fp8,disable}]
[--low_latency_gemm_swiglu_plugin {fp8,disable}] [--context_fmha {enable,disable}]
[--bert_context_fmha_fp32_acc {enable,disable}] [--remove_input_padding {enable,disable}] [--reduce_fusion {enable,disable}]
[--user_buffer {enable,disable}] [--tokens_per_block TOKENS_PER_BLOCK] [--use_paged_context_fmha {enable,disable}]
[--use_fp8_context_fmha {enable,disable}] [--multiple_profiles {enable,disable}] [--paged_state {enable,disable}]
[--streamingllm {enable,disable}] [--use_fused_mlp {enable,disable}] [--pp_reduce_scatter {enable,disable}]
Named Arguments
- --checkpoint_dir
The directory path that contains TensorRT-LLM checkpoint.
- --model_config
The file path that saves TensorRT-LLM checkpoint config.
- --build_config
The file path that saves TensorRT-LLM build config.
- --model_cls_file
The file path that defines customized TensorRT-LLM model.
- --model_cls_name
The customized TensorRT-LLM model class name.
- --output_dir
The directory path to save the serialized engine files and engine config file.
Default:
'engine_outputs'
- --max_batch_size
Maximum number of requests that the engine can schedule.
Default:
2048
- --max_input_len
Maximum input length of one request.
Default:
1024
- --max_seq_len, --max_decoder_seq_len
Maximum total length of one request, including prompt and outputs. If unspecified, the value is deduced from the model config.
- --max_beam_width
Maximum number of beams for beam search decoding.
Default:
1
- --max_num_tokens
Maximum number of batched input tokens after padding is removed in each batch. Currently, the input padding is removed by default; you may explicitly disable it by specifying
--remove_input_padding disable
.Default:
8192
- --opt_num_tokens
Optimal number of batched input tokens after padding is removed in each batch It equals to
max_batch_size * max_beam_width
by default, set this value as close as possible to the actual number of tokens on your workload. Note that this argument might be removed in the future.- --max_encoder_input_len
Maximum encoder input length for enc-dec models. Set
max_input_len
to 1 to start generation from decoder_start_token_id of length 1.Default:
1024
- --max_prompt_embedding_table_size, --max_multimodal_len
Maximum prompt embedding table size for prompt tuning, or maximum multimodal input size for multimodal models. Setting a value > 0 enables prompt tuning or multimodal input.
Default:
0
- --kv_cache_type
Set KV cache type (continuous, paged, or disabled). For disabled case, KV cache is disabled and only context phase is allowed.
- --paged_kv_cache
Deprecated. Enabling this option is equvilient to
--kv_cache_type paged
for transformer based models.- --input_timing_cache
The file path to read the timing cache. This option is ignored if the file does not exist.
- --output_timing_cache
The file path to write the timing cache.
Default:
'model.cache'
- --profiling_verbosity
Possible choices: layer_names_only, detailed, none
The profiling verbosity for the generated TensorRT engine. Setting to detailed allows inspecting tactic choices and kernel parameters.
Default:
'layer_names_only'
- --strip_plan
Enable stripping weights from the final TensorRT engine under the assumption that the refit weights are identical to those provided at build time.
Default:
False
- --weight_sparsity
Enable weight sparsity.
Default:
False
- --weight_streaming
Enable offloading weights to CPU and streaming loading at runtime.
Default:
False
- --fast_build
Enable features for faster engine building. This may cause some performance degradation and is currently incompatible with int8/int4 quantization without plugin.
Default:
False
- --workers
The number of workers for building in parallel.
Default:
1
- --log_level
Possible choices: internal_error, error, warning, info, verbose, debug
The logging level.
Default:
'info'
- --enable_debug_output
Enable debug output.
Default:
False
- --visualize_network
Export TensorRT Networks to ONNX prior to Engine build for debugging.
Default:
False
- --dry_run
Run through the build process except the actual Engine build for debugging.
Default:
False
- --monitor_memory
Enable memory monitor during Engine build.
Default:
False
Logits arguments
- --logits_dtype
Possible choices: float16, float32
The data type of logits.
- --gather_context_logits
Enable gathering context logits.
Default:
False
- --gather_generation_logits
Enable gathering generation logits.
Default:
False
- --gather_all_token_logits
Enable both
gather_context_logits
andgather_generation_logits
.Default:
False
LoRA arguments
- --lora_dir
The directory of LoRA weights. If multiple directories are provided, the first one is used for configuration.
- --lora_ckpt_source
Possible choices: hf, nemo
The source type of LoRA checkpoint.
Default:
'hf'
- --lora_target_modules
Possible choices: attn_qkv, attn_q, attn_k, attn_v, attn_dense, mlp_h_to_4h, mlp_4h_to_h, mlp_gate, cross_attn_qkv, cross_attn_q, cross_attn_k, cross_attn_v, cross_attn_dense, moe_h_to_4h, moe_4h_to_h, moe_gate, moe_router, mlp_router
The target module names that LoRA is applied. Only effective when
lora_plugin
is enabled.- --max_lora_rank
Maximum LoRA rank for different LoRA modules. It is used to compute the workspace size of LoRA plugin.
Default:
64
Speculative decoding arguments
- --speculative_decoding_mode
Possible choices: draft_tokens_external, lookahead_decoding, medusa, explicit_draft_tokens, eagle
Mode of speculative decoding.
- --max_draft_len
Maximum lengths of draft tokens for speculative decoding target model.
Default:
0
Auto parallel arguments
- --auto_parallel
MPI world size for auto parallel.
Default:
1
- --gpus_per_node
Number of GPUs each node has in a multi-node setup. This is a cluster spec and can be greater/smaller than world size. This option is only used for auto parallel specified with
--auto_parallel
.Default:
8
- --cluster_key
Possible choices: A100-SXM-80GB, A100-SXM-40GB, A100-PCIe-80GB, A100-PCIe-40GB, H100-SXM, H100-PCIe, H20, V100-PCIe-16GB, V100-PCIe-32GB, V100-SXM-16GB, V100-SXM-32GB, V100S-PCIe, A40, A30, A10, A10G, L40S, L40, L20, L4, L2
Unique name for target GPU type. Inferred from current GPU type if not specified. This option is only used for auto parallel specified with
--auto_parallel
.
Plugin config arguments
- --bert_attention_plugin
Possible choices: auto, float16, float32, bfloat16, int32, disable
The plugin that uses efficient kernels and enables an in-place update of the KV cache for attention layer of BERT-like encoder models.
Default:
'auto'
- --gpt_attention_plugin
Possible choices: auto, float16, float32, bfloat16, int32, disable
The plugin that uses efficient kernels and enables an in-place update of the KV cache for attention layer of GPT-like decoder models.
Default:
'auto'
- --gemm_plugin
Possible choices: auto, float16, float32, bfloat16, int32, fp8, disable
The GEMM plugin that utilizes NVIDIA cuBLASLt to perform GEMM operations. Note: it’s only affective for non-quantized gemm operations (except FP8).Note: For FP8, it also requires same calibration in checkpoint.
Default:
'disable'
- --gemm_swiglu_plugin
Possible choices: fp8, disable
The GEMM + SwiGLU fusion in Gated-MLP combines two Matmul operations and one SwiGLU operation into a single kernel. Currently this is only supported for FP8 precision on Hopper.
Default:
'disable'
- --fp8_rowwise_gemm_plugin
Possible choices: auto, float16, float32, bfloat16, int32, disable
The quantized GEMM for fp8, which uses per token dynamic scales for activation and per channel static scales for weights.Note: It also requires same calibration in checkpoint.
Default:
'disable'
- --nccl_plugin
Possible choices: auto, float16, float32, bfloat16, int32, disable
The NCCL plugin wraps NCCL operators to support multi-GPU and even multi-nodes.
Default:
'auto'
- --lora_plugin
Possible choices: auto, float16, float32, bfloat16, int32, disable
Enable LoRA.
Default:
'disable'
- --moe_plugin
Possible choices: auto, float16, float32, bfloat16, int32, disable
Enable some customized kernels to speed up the MoE layer of MoE models.
Default:
'auto'
- --mamba_conv1d_plugin
Possible choices: auto, float16, float32, bfloat16, int32, disable
Enable customized kernels to speed up conv1d operator for Mamba.
Default:
'auto'
- --low_latency_gemm_plugin
Possible choices: fp8, disable
The GEMM plugin that optimized specially for low latency scenarios.
Default:
'disable'
- --low_latency_gemm_swiglu_plugin
Possible choices: fp8, disable
The GEMM + SwiGLU fusion plugin that optimized specially for low latency scenarios.
Default:
'disable'
- --context_fmha
Possible choices: enable, disable
Enable the fused multi-head attention during the context phase, will trigger a kernel that performs the MHA/MQA/GQA block using a single kernel.
Default:
'enable'
- --bert_context_fmha_fp32_acc
Possible choices: enable, disable
Enable the FP32 accumulator for context FMHA in the bert_attention_plugin. If disabled, FP16 is used, better performance but potentially worse accuracy is expected.
Default:
'disable'
- --remove_input_padding
Possible choices: enable, disable
Pack different tokens together, which reduces both the amount of computations and memory consumption.
Default:
'enable'
- --reduce_fusion
Possible choices: enable, disable
Fuse the ResidualAdd and LayerNorm kernels after AllReduce into a single kernel, resulting in improved end-to-end performance.
Default:
'disable'
- --user_buffer
Possible choices: enable, disable
Eliminate extra copies from the local buffer to the shared buffer in the communication kernel, leading to improved end-to-end performance. This feature must be enabled with –reduce_fusion enable and is currently only supported for the FP8 LLAMA model.
Default:
'disable'
- --tokens_per_block
Define how many tokens are contained in each paged kv cache block.
Default:
64
- --use_paged_context_fmha
Possible choices: enable, disable
Allow advanced features like KV cache reuse and chunked context.
Default:
'disable'
- --use_fp8_context_fmha
Possible choices: enable, disable
When FP8 quantization is activated, the attention can be further accelerated by enabling FP8 Context FMHA
Default:
'disable'
- --multiple_profiles
Possible choices: enable, disable
Enables multiple TensorRT optimization profiles in the built engines, will benefits the performance especially when GEMM plugin is disabled, because more optimization profiles help TensorRT have more chances to select better kernels. Note: This feature increases engine build time but no other adverse effects are expected.
Default:
'disable'
- --paged_state
Possible choices: enable, disable
Enable paged state, which helps manage memory for the RNN state more efficiently.
Default:
'enable'
- --streamingllm
Possible choices: enable, disable
Enable [StreamingLLM](https://arxiv.org/abs/2309.17453), which uses a window attention to perform efficient and stable LLM on long texts.
Default:
'disable'
- --use_fused_mlp
Possible choices: enable, disable
Enable horizontal fusion in Gated-MLP that combines two Matmul operations into a single one followed by a separate SwiGLU kernel.
Default:
'enable'
- --pp_reduce_scatter
Possible choices: enable, disable
Enable a pipeline parallelism optimization with ReduceScatter + AllGather targeting large MoE models.
Default:
'disable'