Plugin#
- pydantic model tensorrt_llm.plugin.PluginConfig[source]#
Bases:
BaseModel
The config that manages plugin-related options.
There are two option categories: * Plugin options (typically with xxx_plugin naming). These options can be assigned with:
“float16”/”bfloat16”/”float32”/”int32”, which means the plugin is enabled with the specified precision; (Some plugins only support limited dtype, i.e., gemm_swiglu_plugin and low_latency_gemm_swiglu_plugin only supports fp8 now)
“auto”, which means the plugin is enabled with the precision of dtype field (the dtype field must be same to model dtype, i.e., the one in PretrainedConfig);
None, which means the plugin is disabled.
- Other features. These options can be assigned with boolean:
True, which means the plugin is enabled;
False, which means the plugin is disabled.
Show JSON schema
{ "title": "PluginConfig", "description": "The config that manages plugin-related options.\n\nThere are two option categories:\n* Plugin options (typically with xxx_plugin naming). These options can be assigned with:\n * \"float16\"/\"bfloat16\"/\"float32\"/\"int32\", which means the plugin is enabled with the specified precision; (Some plugins only support limited dtype, i.e., gemm_swiglu_plugin and low_latency_gemm_swiglu_plugin only supports fp8 now)\n * \"auto\", which means the plugin is enabled with the precision of `dtype` field (the `dtype` field must be same to model dtype, i.e., the one in PretrainedConfig);\n * None, which means the plugin is disabled.\n* Other features. These options can be assigned with boolean:\n * True, which means the plugin is enabled;\n * False, which means the plugin is disabled.", "type": "object", "properties": { "dtype": { "default": "float16", "description": "Base dtype for the model and plugins", "title": "Dtype", "type": "string" }, "bert_attention_plugin": { "anyOf": [ { "enum": [ "auto", "float16", "float32", "bfloat16", "int32", null ] }, { "type": "null" } ], "default": "auto", "description": "The plugin that uses efficient kernels and enables an in-place update of the KV cache for attention layer of BERT-like encoder models.", "title": "Bert Attention Plugin" }, "gpt_attention_plugin": { "anyOf": [ { "enum": [ "auto", "float16", "float32", "bfloat16", "int32", null ] }, { "type": "null" } ], "default": "auto", "description": "The plugin that uses efficient kernels and enables an in-place update of the KV cache for attention layer of GPT-like decoder models.", "title": "Gpt Attention Plugin" }, "gemm_plugin": { "anyOf": [ { "enum": [ "auto", "float16", "float32", "bfloat16", "int32", "fp8", "nvfp4", null ] }, { "type": "null" } ], "default": null, "description": "The GEMM plugin that utilizes NVIDIA cuBLASLt to perform GEMM operations. Note: it's only affective for non-quantized gemm operations (except FP8).Note: For FP8, it also requires same calibration in checkpoint.", "title": "Gemm Plugin" }, "gemm_swiglu_plugin": { "anyOf": [ { "enum": [ "fp8", null ] }, { "type": "null" } ], "default": null, "description": "The GEMM + SwiGLU fusion in Gated-MLP combines two Matmul operations and one SwiGLU operation into a single kernel. Currently this is only supported for FP8 precision on Hopper.", "title": "Gemm Swiglu Plugin" }, "fp8_rowwise_gemm_plugin": { "anyOf": [ { "enum": [ "auto", "float16", "float32", "bfloat16", "int32", null ] }, { "type": "null" } ], "default": null, "description": "The quantized GEMM for fp8, which uses per token dynamic scales for activation and per channel static scales for weights.Note: It also requires same calibration in checkpoint.", "title": "Fp8 Rowwise Gemm Plugin" }, "qserve_gemm_plugin": { "anyOf": [ { "enum": [ "auto", "float16", "float32", "bfloat16", "int32", null ] }, { "type": "null" } ], "default": null, "description": "The quantized GEMM from [QServe](https://arxiv.org/abs/2405.04532), which employs 4-bit quantization for weights and 8-bit quantization for activations.", "title": "Qserve Gemm Plugin" }, "identity_plugin": { "anyOf": [ { "enum": [ "auto", "float16", "float32", "bfloat16", "int32", null ] }, { "type": "null" } ], "default": null, "description": "The identity plugin simply copies inputs to outputs, it's used mostly for debugging purpose.", "title": "Identity Plugin" }, "nccl_plugin": { "anyOf": [ { "enum": [ "auto", "float16", "float32", "bfloat16", "int32", null ] }, { "type": "null" } ], "default": "auto", "description": "The NCCL plugin wraps NCCL operators to support multi-GPU and even multi-nodes.", "title": "Nccl Plugin" }, "lora_plugin": { "anyOf": [ { "enum": [ "auto", "float16", "float32", "bfloat16", "int32", null ] }, { "type": "null" } ], "default": null, "description": "Enable LoRA.", "title": "Lora Plugin" }, "dora_plugin": { "default": false, "description": "Enable DoRA.", "title": "Dora Plugin", "type": "boolean" }, "weight_only_groupwise_quant_matmul_plugin": { "anyOf": [ { "enum": [ "auto", "float16", "float32", "bfloat16", "int32", null ] }, { "type": "null" } ], "default": null, "description": "Enable weight-only groupwise quantization matmul operators.", "title": "Weight Only Groupwise Quant Matmul Plugin" }, "weight_only_quant_matmul_plugin": { "anyOf": [ { "enum": [ "auto", "float16", "float32", "bfloat16", "int32", null ] }, { "type": "null" } ], "default": null, "description": "Enable weight-only quantization matmul operators.", "title": "Weight Only Quant Matmul Plugin" }, "smooth_quant_plugins": { "default": true, "description": "Enable a group of plugins to support smooth quantization.", "title": "Smooth Quant Plugins", "type": "boolean" }, "smooth_quant_gemm_plugin": { "anyOf": [ { "enum": [ "auto", "float16", "float32", "bfloat16", "int32", null ] }, { "type": "null" } ], "default": null, "description": "Enable plugin that supports smooth quantization gemm kernels.", "title": "Smooth Quant Gemm Plugin" }, "layernorm_quantization_plugin": { "anyOf": [ { "enum": [ "auto", "float16", "float32", "bfloat16", "int32", null ] }, { "type": "null" } ], "default": null, "description": "Enable plugin that supports layernorm quantization kernels.", "title": "Layernorm Quantization Plugin" }, "rmsnorm_quantization_plugin": { "anyOf": [ { "enum": [ "auto", "float16", "float32", "bfloat16", "int32", null ] }, { "type": "null" } ], "default": null, "description": "Enable plugin that supports rmsnorm quantization kernels.", "title": "Rmsnorm Quantization Plugin" }, "quantize_per_token_plugin": { "default": false, "description": "Enable plugin that supports per-token quantization.", "title": "Quantize Per Token Plugin", "type": "boolean" }, "quantize_tensor_plugin": { "default": false, "description": "Enable plugin that supports per-tensor quantization.", "title": "Quantize Tensor Plugin", "type": "boolean" }, "moe_plugin": { "anyOf": [ { "enum": [ "auto", "float16", "float32", "bfloat16", "int32", null ] }, { "type": "null" } ], "default": "auto", "description": "Enable some customized kernels to speed up the MoE layer of MoE models.", "title": "Moe Plugin" }, "mamba_conv1d_plugin": { "anyOf": [ { "enum": [ "auto", "float16", "float32", "bfloat16", "int32", null ] }, { "type": "null" } ], "default": "auto", "description": "Enable customized kernels to speed up conv1d operator for Mamba.", "title": "Mamba Conv1D Plugin" }, "low_latency_gemm_plugin": { "anyOf": [ { "enum": [ "fp8", null ] }, { "type": "null" } ], "default": null, "description": "The GEMM plugin that optimized specially for low latency scenarios.", "title": "Low Latency Gemm Plugin" }, "low_latency_gemm_swiglu_plugin": { "anyOf": [ { "enum": [ "fp8", null ] }, { "type": "null" } ], "default": null, "description": "The GEMM + SwiGLU fusion plugin that optimized specially for low latency scenarios.", "title": "Low Latency Gemm Swiglu Plugin" }, "gemm_allreduce_plugin": { "anyOf": [ { "enum": [ "float16", "bfloat16", null ] }, { "type": "null" } ], "default": null, "description": "The GEMM + AllReduce kernel fusion plugin.", "title": "Gemm Allreduce Plugin" }, "context_fmha": { "default": true, "description": "Enable the fused multi-head attention during the context phase, will trigger a kernel that performs the MHA/MQA/GQA block using a single kernel.", "title": "Context Fmha", "type": "boolean" }, "bert_context_fmha_fp32_acc": { "default": false, "description": "Enable the FP32 accumulator for context FMHA in the bert_attention_plugin. If disabled, FP16 is used, better performance but potentially worse accuracy is expected.", "title": "Bert Context Fmha Fp32 Acc", "type": "boolean" }, "paged_kv_cache": { "anyOf": [ { "type": "boolean" }, { "type": "null" } ], "default": null, "description": "Enable paged KV cache, which helps manage memory for the KV cache more efficiently, and usually leads to an increase in the batch size and an improved efficiency.", "title": "Paged Kv Cache" }, "remove_input_padding": { "default": true, "description": "Pack different tokens together, which reduces both the amount of computations and memory consumption.", "title": "Remove Input Padding", "type": "boolean" }, "norm_quant_fusion": { "default": false, "description": "Fuse the LayerNorm and quantization kernels into a single kernel, resulting in improved end-to-end performance.", "title": "Norm Quant Fusion", "type": "boolean" }, "reduce_fusion": { "default": false, "description": "Fuse the ResidualAdd and LayerNorm kernels after AllReduce into a single kernel, resulting in improved end-to-end performance.", "title": "Reduce Fusion", "type": "boolean" }, "user_buffer": { "default": false, "description": "Eliminate extra copies from the local buffer to the shared buffer in the communication kernel, leading to improved end-to-end performance. This feature must be enabled with `--reduce_fusion enable` and is currently only supported for the FP8 LLAMA model.", "title": "User Buffer", "type": "boolean" }, "tokens_per_block": { "default": 32, "description": "Define how many tokens are contained in each paged kv cache block.", "title": "Tokens Per Block", "type": "integer" }, "use_paged_context_fmha": { "default": true, "description": "Allow advanced features like KV cache reuse and chunked context.", "title": "Use Paged Context Fmha", "type": "boolean" }, "use_fp8_context_fmha": { "default": true, "description": "When FP8 quantization is activated, the attention can be further accelerated by enabling FP8 Context FMHA", "title": "Use Fp8 Context Fmha", "type": "boolean" }, "fuse_fp4_quant": { "default": false, "description": "Whether to fuse FP4 quantization into attention kernel.", "title": "Fuse Fp4 Quant", "type": "boolean" }, "multiple_profiles": { "default": false, "description": "Enables multiple TensorRT optimization profiles in the built engines, will benefits the performance especially when GEMM plugin is disabled, because more optimization profiles help TensorRT have more chances to select better kernels. Note: This feature increases engine build time but no other adverse effects are expected.", "title": "Multiple Profiles", "type": "boolean" }, "paged_state": { "default": true, "description": "Enable paged state, which helps manage memory for the RNN state more efficiently.", "title": "Paged State", "type": "boolean" }, "streamingllm": { "default": false, "description": "Enable [StreamingLLM](https://arxiv.org/abs/2309.17453), which uses a window attention to perform efficient and stable LLM on long texts.", "title": "Streamingllm", "type": "boolean" }, "manage_weights": { "default": false, "description": "Enable TensorRT LLM managed weights to speed up engine building process.", "title": "Manage Weights", "type": "boolean" }, "use_fused_mlp": { "default": true, "description": "Enable horizontal fusion in Gated-MLP that combines two Matmul operations into a single one followed by a separate SwiGLU kernel.", "title": "Use Fused Mlp", "type": "boolean" }, "pp_reduce_scatter": { "default": false, "description": "Enable a pipeline parallelism optimization with ReduceScatter + AllGather targeting large MoE models.", "title": "Pp Reduce Scatter", "type": "boolean" } } }
- Config:
validate_assignment: bool = True
extra: str = ignore
- Fields:
bert_attention_plugin (Literal['auto', 'float16', 'float32', 'bfloat16', 'int32', None] | None)
bert_context_fmha_fp32_acc (bool)
context_fmha (bool)
dora_plugin (bool)
dtype (str)
fp8_rowwise_gemm_plugin (Literal['auto', 'float16', 'float32', 'bfloat16', 'int32', None] | None)
fuse_fp4_quant (bool)
gemm_allreduce_plugin (Literal['float16', 'bfloat16', None] | None)
gemm_plugin (Literal['auto', 'float16', 'float32', 'bfloat16', 'int32', 'fp8', 'nvfp4', None] | None)
gemm_swiglu_plugin (Literal['fp8', None] | None)
gpt_attention_plugin (Literal['auto', 'float16', 'float32', 'bfloat16', 'int32', None] | None)
identity_plugin (Literal['auto', 'float16', 'float32', 'bfloat16', 'int32', None] | None)
layernorm_quantization_plugin (Literal['auto', 'float16', 'float32', 'bfloat16', 'int32', None] | None)
lora_plugin (Literal['auto', 'float16', 'float32', 'bfloat16', 'int32', None] | None)
low_latency_gemm_plugin (Literal['fp8', None] | None)
low_latency_gemm_swiglu_plugin (Literal['fp8', None] | None)
mamba_conv1d_plugin (Literal['auto', 'float16', 'float32', 'bfloat16', 'int32', None] | None)
manage_weights (bool)
moe_plugin (Literal['auto', 'float16', 'float32', 'bfloat16', 'int32', None] | None)
multiple_profiles (bool)
nccl_plugin (Literal['auto', 'float16', 'float32', 'bfloat16', 'int32', None] | None)
norm_quant_fusion (bool)
paged_kv_cache (bool | None)
paged_state (bool)
pp_reduce_scatter (bool)
qserve_gemm_plugin (Literal['auto', 'float16', 'float32', 'bfloat16', 'int32', None] | None)
quantize_per_token_plugin (bool)
quantize_tensor_plugin (bool)
reduce_fusion (bool)
remove_input_padding (bool)
rmsnorm_quantization_plugin (Literal['auto', 'float16', 'float32', 'bfloat16', 'int32', None] | None)
smooth_quant_gemm_plugin (Literal['auto', 'float16', 'float32', 'bfloat16', 'int32', None] | None)
smooth_quant_plugins (bool)
streamingllm (bool)
tokens_per_block (int)
use_fp8_context_fmha (bool)
use_fused_mlp (bool)
use_paged_context_fmha (bool)
user_buffer (bool)
weight_only_groupwise_quant_matmul_plugin (Literal['auto', 'float16', 'float32', 'bfloat16', 'int32', None] | None)
weight_only_quant_matmul_plugin (Literal['auto', 'float16', 'float32', 'bfloat16', 'int32', None] | None)
- Validators:
convert_enable_disable
»all fields
log_field_changes
»all fields
validate_dtype_not_auto
»dtype
- field bert_attention_plugin: Literal['auto', 'float16', 'float32', 'bfloat16', 'int32', None] | None = 'auto'#
The plugin that uses efficient kernels and enables an in-place update of the KV cache for attention layer of BERT-like encoder models.
- field bert_context_fmha_fp32_acc: bool = False#
Enable the FP32 accumulator for context FMHA in the bert_attention_plugin. If disabled, FP16 is used, better performance but potentially worse accuracy is expected.
- field context_fmha: bool = True#
Enable the fused multi-head attention during the context phase, will trigger a kernel that performs the MHA/MQA/GQA block using a single kernel.
- field dora_plugin: bool = False#
Enable DoRA.
- field dtype: str = 'float16'#
Base dtype for the model and plugins
- field fp8_rowwise_gemm_plugin: Literal['auto', 'float16', 'float32', 'bfloat16', 'int32', None] | None = None#
The quantized GEMM for fp8, which uses per token dynamic scales for activation and per channel static scales for weights.Note: It also requires same calibration in checkpoint.
- field fuse_fp4_quant: bool = False#
Whether to fuse FP4 quantization into attention kernel.
- field gemm_allreduce_plugin: Literal['float16', 'bfloat16', None] | None = None#
The GEMM + AllReduce kernel fusion plugin.
- field gemm_plugin: Literal['auto', 'float16', 'float32', 'bfloat16', 'int32', 'fp8', 'nvfp4', None] | None = None#
The GEMM plugin that utilizes NVIDIA cuBLASLt to perform GEMM operations. Note: it’s only affective for non-quantized gemm operations (except FP8).Note: For FP8, it also requires same calibration in checkpoint.
- field gemm_swiglu_plugin: Literal['fp8', None] | None = None#
The GEMM + SwiGLU fusion in Gated-MLP combines two Matmul operations and one SwiGLU operation into a single kernel. Currently this is only supported for FP8 precision on Hopper.
- field gpt_attention_plugin: Literal['auto', 'float16', 'float32', 'bfloat16', 'int32', None] | None = 'auto'#
The plugin that uses efficient kernels and enables an in-place update of the KV cache for attention layer of GPT-like decoder models.
- field identity_plugin: Literal['auto', 'float16', 'float32', 'bfloat16', 'int32', None] | None = None#
The identity plugin simply copies inputs to outputs, it’s used mostly for debugging purpose.
- field layernorm_quantization_plugin: Literal['auto', 'float16', 'float32', 'bfloat16', 'int32', None] | None = None#
Enable plugin that supports layernorm quantization kernels.
- field lora_plugin: Literal['auto', 'float16', 'float32', 'bfloat16', 'int32', None] | None = None#
Enable LoRA.
- field low_latency_gemm_plugin: Literal['fp8', None] | None = None#
The GEMM plugin that optimized specially for low latency scenarios.
- field low_latency_gemm_swiglu_plugin: Literal['fp8', None] | None = None#
The GEMM + SwiGLU fusion plugin that optimized specially for low latency scenarios.
- field mamba_conv1d_plugin: Literal['auto', 'float16', 'float32', 'bfloat16', 'int32', None] | None = 'auto'#
Enable customized kernels to speed up conv1d operator for Mamba.
- field manage_weights: bool = False#
Enable TensorRT LLM managed weights to speed up engine building process.
- field moe_plugin: Literal['auto', 'float16', 'float32', 'bfloat16', 'int32', None] | None = 'auto'#
Enable some customized kernels to speed up the MoE layer of MoE models.
- field multiple_profiles: bool = False#
Enables multiple TensorRT optimization profiles in the built engines, will benefits the performance especially when GEMM plugin is disabled, because more optimization profiles help TensorRT have more chances to select better kernels. Note: This feature increases engine build time but no other adverse effects are expected.
- field nccl_plugin: Literal['auto', 'float16', 'float32', 'bfloat16', 'int32', None] | None = 'auto'#
The NCCL plugin wraps NCCL operators to support multi-GPU and even multi-nodes.
- field norm_quant_fusion: bool = False#
Fuse the LayerNorm and quantization kernels into a single kernel, resulting in improved end-to-end performance.
- field paged_kv_cache: bool | None = None#
Enable paged KV cache, which helps manage memory for the KV cache more efficiently, and usually leads to an increase in the batch size and an improved efficiency.
- field paged_state: bool = True#
Enable paged state, which helps manage memory for the RNN state more efficiently.
- field pp_reduce_scatter: bool = False#
Enable a pipeline parallelism optimization with ReduceScatter + AllGather targeting large MoE models.
- field qserve_gemm_plugin: Literal['auto', 'float16', 'float32', 'bfloat16', 'int32', None] | None = None#
The quantized GEMM from [QServe](https://arxiv.org/abs/2405.04532), which employs 4-bit quantization for weights and 8-bit quantization for activations.
- field quantize_per_token_plugin: bool = False#
Enable plugin that supports per-token quantization.
- field quantize_tensor_plugin: bool = False#
Enable plugin that supports per-tensor quantization.
- field reduce_fusion: bool = False#
Fuse the ResidualAdd and LayerNorm kernels after AllReduce into a single kernel, resulting in improved end-to-end performance.
- field remove_input_padding: bool = True#
Pack different tokens together, which reduces both the amount of computations and memory consumption.
- field rmsnorm_quantization_plugin: Literal['auto', 'float16', 'float32', 'bfloat16', 'int32', None] | None = None#
Enable plugin that supports rmsnorm quantization kernels.
- field smooth_quant_gemm_plugin: Literal['auto', 'float16', 'float32', 'bfloat16', 'int32', None] | None = None#
Enable plugin that supports smooth quantization gemm kernels.
- field smooth_quant_plugins: bool = True#
Enable a group of plugins to support smooth quantization.
- field streamingllm: bool = False#
Enable [StreamingLLM](https://arxiv.org/abs/2309.17453), which uses a window attention to perform efficient and stable LLM on long texts.
- field tokens_per_block: int = 32#
Define how many tokens are contained in each paged kv cache block.
- field use_fp8_context_fmha: bool = True#
When FP8 quantization is activated, the attention can be further accelerated by enabling FP8 Context FMHA
- field use_fused_mlp: bool = True#
Enable horizontal fusion in Gated-MLP that combines two Matmul operations into a single one followed by a separate SwiGLU kernel.
- field use_paged_context_fmha: bool = True#
Allow advanced features like KV cache reuse and chunked context.
- field user_buffer: bool = False#
Eliminate extra copies from the local buffer to the shared buffer in the communication kernel, leading to improved end-to-end performance. This feature must be enabled with –reduce_fusion enable and is currently only supported for the FP8 LLAMA model.
- field weight_only_groupwise_quant_matmul_plugin: Literal['auto', 'float16', 'float32', 'bfloat16', 'int32', None] | None = None#
Enable weight-only groupwise quantization matmul operators.
- field weight_only_quant_matmul_plugin: Literal['auto', 'float16', 'float32', 'bfloat16', 'int32', None] | None = None#
Enable weight-only quantization matmul operators.
- validator convert_enable_disable » all fields[source]#
Allow passing enable/disable strings which map to boolean/None values.
- model_post_init(context: Any, /) None #
This function is meant to behave like a BaseModel method to initialise private attributes.
It takes context as an argument since that’s what pydantic-core passes when calling it.
- Parameters:
self – The BaseModel instance.
context – The context.
- to_legacy_setting()[source]#
Legacy setting means that all of the plugins and features are disabled, this is needed for the legacy build.py script, which will be migrated to the centralized building script tensorrt_llm/commands/build.py.
After the migration is done, this function may or may not be deleted.
- property context_fmha_type#