Pattern Matching Stage#
Pattern matching canonicalizes model-specific PyTorch graphs into AutoDeploy’s standard graph representation. These transforms identify attention, MoE, normalization, quantization, activation, and layout patterns before sharding and post-load fusion run.
Match MoE Pattern#
Transform key: match_moe_pattern
Source module: tensorrt_llm._torch.auto_deploy.transform.library.fused_moe
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.fused_moe.MatchSimpleMoePattern(
- config: TransformConfig,
Bases:
MatchMoePatternMatch and fuse simple (unquantized) MoE subgraph.
YAML configuration
Uses the common TransformConfig fields documented in Core Transform APIs.
Match Dense MoE Pattern#
Transform key: match_dense_moe_pattern
Source module: tensorrt_llm._torch.auto_deploy.transform.library.fused_moe_mxfp4
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.fused_moe_mxfp4.MatchMXFP4MoePattern(
- config: TransformConfig,
Bases:
BaseTransform
YAML configuration
Uses the common TransformConfig fields documented in Core Transform APIs.
Match Bmm MoE Pattern#
Transform key: match_bmm_moe_pattern
Source module: tensorrt_llm._torch.auto_deploy.transform.library.fused_moe
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.fused_moe.MatchBmmMoePattern(
- config: TransformConfig,
Bases:
BaseTransformMatch and fuse Llama4 MoE pattern with pre-stacked weight tensors.
This pattern uses batch matrix multiply (BMM) operations for parallel expert computation with weights already stacked across the expert dimension.
Only matches patterns where topk uses k=1 (single expert per token).
YAML configuration
The fields below can be set under this transform’s entry in the AutoDeploy config YAML.
- pydantic model tensorrt_llm._torch.auto_deploy.transform.library.fused_moe.MatchBmmMoePatternConfig[source]
Bases:
TransformConfigConfiguration for MatchBmmMoePattern transform.
Show JSON schema
{ "title": "MatchBmmMoePatternConfig", "description": "Configuration for MatchBmmMoePattern transform.", "type": "object", "properties": { "stage": { "$ref": "#/$defs/Stages", "description": "The stage of the transformation pipeline where this transform should run." }, "run_per_gm": { "default": true, "description": "Whether to run the transform per graph (sub)module or on whole module.", "title": "Run Per Gm", "type": "boolean" }, "enabled": { "default": true, "description": "Whether to enable this transform.", "title": "Enabled", "type": "boolean" }, "skip_on_error": { "default": false, "description": "Whether to skip the transform if an error occurs.", "title": "Skip On Error", "type": "boolean" }, "run_graph_cleanup": { "default": true, "description": "Whether to run graph cleanup/canonicalization after this transform.", "title": "Run Graph Cleanup", "type": "boolean" }, "run_shape_prop": { "default": false, "description": "Whether to run shape propagation after this transform.", "title": "Run Shape Prop", "type": "boolean" }, "requires_clean_graph": { "default": true, "description": "Whether this transform requires the graph to be clean before it is applied.", "title": "Requires Clean Graph", "type": "boolean" }, "requires_shape_prop": { "default": false, "description": "Whether this transform requires shape propagation before it is applied.", "title": "Requires Shape Prop", "type": "boolean" }, "debug_visualize_dir": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.", "title": "Debug Visualize Dir" }, "expect_mem_change": { "default": false, "description": "Whether this transform is expected to cause changes in CUDA memory stats.", "title": "Expect Mem Change", "type": "boolean" } }, "$defs": { "Stages": { "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.", "enum": [ "factory", "export", "post_export", "pattern_matcher", "sharding", "weight_load", "post_load_fusion", "cache_init", "visualize", "compile" ], "title": "Stages", "type": "string" } }, "additionalProperties": true, "required": [ "stage" ] }
- Config:
extra: str = allow
- Fields:
Split MoE Fused For Sharding#
Transform key: split_moe_fused_for_sharding
Source module: tensorrt_llm._torch.auto_deploy.transform.library.fused_moe
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.fused_moe.SplitMoeFusedForSharding(
- config: TransformConfig,
Bases:
BaseTransformConvert torch_moe_fused nodes to list-based torch_moe nodes before sharding.
- classmethod get_config_class() Type[TransformConfig][source]#
Get the configuration class for the transform.
This is used to validate the configuration of the transform.
YAML configuration
The fields below can be set under this transform’s entry in the AutoDeploy config YAML.
- pydantic model tensorrt_llm._torch.auto_deploy.transform.library.fused_moe.SplitMoeFusedForShardingConfig[source]
Bases:
TransformConfigConfiguration for converting torch_moe_fused to torch_moe pre-sharding.
Show JSON schema
{ "title": "SplitMoeFusedForShardingConfig", "description": "Configuration for converting torch_moe_fused to torch_moe pre-sharding.", "type": "object", "properties": { "stage": { "$ref": "#/$defs/Stages", "description": "The stage of the transformation pipeline where this transform should run." }, "run_per_gm": { "default": true, "description": "Whether to run the transform per graph (sub)module or on whole module.", "title": "Run Per Gm", "type": "boolean" }, "enabled": { "default": true, "description": "Whether to enable this transform.", "title": "Enabled", "type": "boolean" }, "skip_on_error": { "default": false, "description": "Whether to skip the transform if an error occurs.", "title": "Skip On Error", "type": "boolean" }, "run_graph_cleanup": { "default": true, "description": "Whether to run graph cleanup/canonicalization after this transform.", "title": "Run Graph Cleanup", "type": "boolean" }, "run_shape_prop": { "default": false, "description": "Whether to run shape propagation after this transform.", "title": "Run Shape Prop", "type": "boolean" }, "requires_clean_graph": { "default": true, "description": "Whether this transform requires the graph to be clean before it is applied.", "title": "Requires Clean Graph", "type": "boolean" }, "requires_shape_prop": { "default": false, "description": "Whether this transform requires shape propagation before it is applied.", "title": "Requires Shape Prop", "type": "boolean" }, "debug_visualize_dir": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.", "title": "Debug Visualize Dir" }, "expect_mem_change": { "default": false, "description": "Whether this transform is expected to cause changes in CUDA memory stats.", "title": "Expect Mem Change", "type": "boolean" } }, "$defs": { "Stages": { "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.", "enum": [ "factory", "export", "post_export", "pattern_matcher", "sharding", "weight_load", "post_load_fusion", "cache_init", "visualize", "compile" ], "title": "Stages", "type": "string" } }, "additionalProperties": true, "required": [ "stage" ] }
- Config:
extra: str = allow
- Fields:
Match Repeat KV#
Transform key: match_repeat_kv
Source module: tensorrt_llm._torch.auto_deploy.transform.library.attention
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.attention.MatchRepeatKV(
- config: TransformConfig,
Bases:
BaseTransformMatch and replace the repeat_kv pattern with torch.ops.auto_deploy.torch_attention_repeat_kv.
YAML configuration
Uses the common TransformConfig fields documented in Core Transform APIs.
Match Eager Attention#
Transform key: match_eager_attention
Source module: tensorrt_llm._torch.auto_deploy.transform.library.attention
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.attention.MatchEagerAttention(
- config: TransformConfig,
Bases:
BaseTransformMatch and replace the eager attention pattern with torch.ops.auto_deploy.torch_attention_sdpa.
YAML configuration
Uses the common TransformConfig fields documented in Core Transform APIs.
Match Sdpa To Torch Attention#
Transform key: match_sdpa_to_torch_attention
Source module: tensorrt_llm._torch.auto_deploy.transform.library.attention
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.attention.MatchSDPAToTorchAttention(
- config: TransformConfig,
Bases:
BaseTransformMatch and replace SDPA patterns to torch.ops.auto_deploy.torch_attention.
This handles: - sdpa –> torch_attention - repeat_kv + sdpa –> torch_attention
This transform should run BEFORE match_repeat_kv_with_torch_attention to ensure SDPA calls are converted first.
YAML configuration
Uses the common TransformConfig fields documented in Core Transform APIs.
Match Grouped Attention#
Transform key: match_grouped_attention
Source module: tensorrt_llm._torch.auto_deploy.transform.library.attention
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.attention.MatchRepeatKVWithTorchAttention(
- config: TransformConfig,
Bases:
BaseTransformMatch and replace repeat_kv + torch_attention patterns to torch_attention.
This handles: - repeat_kv + torch_attention –> torch_attention (removes redundant repeat_kv) - torch_attention –> torch_attention (identity, catches any remaining patterns)
This transform should run AFTER match_sdpa_to_torch_attention to ensure we match the repeat_kv + torch_attention pattern correctly.
YAML configuration
Uses the common TransformConfig fields documented in Core Transform APIs.
Match Attention Layout#
Transform key: match_attention_layout
Source module: tensorrt_llm._torch.auto_deploy.transform.library.attention
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.attention.MatchAttentionLayout(
- config: TransformConfig,
Bases:
BaseTransformConvert unified torch_attention calls from layout=’bnsd’ (explicit, positional or default) into layout=’bsnd’ + correct Q/K/V transposes, and transpose the output back to bnsd.
- classmethod get_config_class() Type[TransformConfig][source]#
Get the configuration class for the transform.
This is used to validate the configuration of the transform.
YAML configuration
The fields below can be set under this transform’s entry in the AutoDeploy config YAML.
- pydantic model tensorrt_llm._torch.auto_deploy.transform.library.attention.MatchAttentionLayoutConfig[source]
Bases:
TransformConfigConfiguration for the match attention layout transform.
Show JSON schema
{ "title": "MatchAttentionLayoutConfig", "description": "Configuration for the match attention layout transform.", "type": "object", "properties": { "stage": { "$ref": "#/$defs/Stages", "description": "The stage of the transformation pipeline where this transform should run." }, "run_per_gm": { "default": true, "description": "Whether to run the transform per graph (sub)module or on whole module.", "title": "Run Per Gm", "type": "boolean" }, "enabled": { "default": true, "description": "Whether to enable this transform.", "title": "Enabled", "type": "boolean" }, "skip_on_error": { "default": false, "description": "Whether to skip the transform if an error occurs.", "title": "Skip On Error", "type": "boolean" }, "run_graph_cleanup": { "default": true, "description": "Whether to run graph cleanup/canonicalization after this transform.", "title": "Run Graph Cleanup", "type": "boolean" }, "run_shape_prop": { "default": false, "description": "Whether to run shape propagation after this transform.", "title": "Run Shape Prop", "type": "boolean" }, "requires_clean_graph": { "default": true, "description": "Whether this transform requires the graph to be clean before it is applied.", "title": "Requires Clean Graph", "type": "boolean" }, "requires_shape_prop": { "default": false, "description": "Whether this transform requires shape propagation before it is applied.", "title": "Requires Shape Prop", "type": "boolean" }, "debug_visualize_dir": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.", "title": "Debug Visualize Dir" }, "expect_mem_change": { "default": false, "description": "Whether this transform is expected to cause changes in CUDA memory stats.", "title": "Expect Mem Change", "type": "boolean" }, "attn_layout": { "description": "Layout expected by the attention backend.", "enum": [ "bsnd", "bnsd" ], "title": "Attn Layout", "type": "string" } }, "$defs": { "Stages": { "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.", "enum": [ "factory", "export", "post_export", "pattern_matcher", "sharding", "weight_load", "post_load_fusion", "cache_init", "visualize", "compile" ], "title": "Stages", "type": "string" } }, "additionalProperties": true, "required": [ "stage", "attn_layout" ] }
- Config:
extra: str = allow
- Fields:
attn_layout (Literal['bsnd', 'bnsd'])
- field attn_layout: Literal['bsnd', 'bnsd'] [Required]
Layout expected by the attention backend.
Match RoPE Pattern#
Transform key: match_rope_pattern
Source module: tensorrt_llm._torch.auto_deploy.transform.library.rope
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.rope.MatchRopePattern(
- config: TransformConfig,
Bases:
BaseTransform
YAML configuration
Uses the common TransformConfig fields documented in Core Transform APIs.
Match RoPE Layout#
Transform key: match_rope_layout
Source module: tensorrt_llm._torch.auto_deploy.transform.library.rope
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.rope.MatchRopeLayout(
- config: TransformConfig,
Bases:
BaseTransformMatch and transform input and output of rope ops to the layout specified to meet requirements of optimized ops. Supported layout is ‘bsnd’ (batch, seq, head, dim).
- classmethod get_config_class() Type[TransformConfig][source]#
Get the configuration class for the transform.
This is used to validate the configuration of the transform.
YAML configuration
The fields below can be set under this transform’s entry in the AutoDeploy config YAML.
- pydantic model tensorrt_llm._torch.auto_deploy.transform.library.rope.MatchRopeLayoutConfig[source]
Bases:
TransformConfigConfiguration for the match rope layout transform.
Show JSON schema
{ "title": "MatchRopeLayoutConfig", "description": "Configuration for the match rope layout transform.", "type": "object", "properties": { "stage": { "$ref": "#/$defs/Stages", "description": "The stage of the transformation pipeline where this transform should run." }, "run_per_gm": { "default": true, "description": "Whether to run the transform per graph (sub)module or on whole module.", "title": "Run Per Gm", "type": "boolean" }, "enabled": { "default": true, "description": "Whether to enable this transform.", "title": "Enabled", "type": "boolean" }, "skip_on_error": { "default": false, "description": "Whether to skip the transform if an error occurs.", "title": "Skip On Error", "type": "boolean" }, "run_graph_cleanup": { "default": true, "description": "Whether to run graph cleanup/canonicalization after this transform.", "title": "Run Graph Cleanup", "type": "boolean" }, "run_shape_prop": { "default": false, "description": "Whether to run shape propagation after this transform.", "title": "Run Shape Prop", "type": "boolean" }, "requires_clean_graph": { "default": true, "description": "Whether this transform requires the graph to be clean before it is applied.", "title": "Requires Clean Graph", "type": "boolean" }, "requires_shape_prop": { "default": false, "description": "Whether this transform requires shape propagation before it is applied.", "title": "Requires Shape Prop", "type": "boolean" }, "debug_visualize_dir": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.", "title": "Debug Visualize Dir" }, "expect_mem_change": { "default": false, "description": "Whether this transform is expected to cause changes in CUDA memory stats.", "title": "Expect Mem Change", "type": "boolean" }, "expected_layout": { "default": "bsnd", "description": "The expected layout of the rope operation. Must be one of 'bsnd' or 'bnsd'.", "title": "Expected Layout", "type": "string" } }, "$defs": { "Stages": { "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.", "enum": [ "factory", "export", "post_export", "pattern_matcher", "sharding", "weight_load", "post_load_fusion", "cache_init", "visualize", "compile" ], "title": "Stages", "type": "string" } }, "additionalProperties": true, "required": [ "stage" ] }
- Config:
extra: str = allow
- Fields:
expected_layout (str)
- field expected_layout: str = 'bsnd'
The expected layout of the rope operation. Must be one of ‘bsnd’ or ‘bnsd’.
Match RMSNorm Pattern#
Transform key: match_rmsnorm_pattern
Source module: tensorrt_llm._torch.auto_deploy.transform.library.rms_norm
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.rms_norm.MatchRMSNormPattern(
- config: TransformConfig,
Bases:
BaseTransformMatches RMSNorm patterns in the graph and replaces them with torch_rmsnorm op.
This transform runs in the pattern_matcher stage and standardizes RMSNorm patterns to use torch_rmsnorm op, which can later be fused to a specific backend in the post_load_fusion stage.
- Parameters:
gm – Input graph module to transform.
- Returns:
Transformed graph module with standardized torch_rmsnorm operations.
- classmethod get_config_class() Type[TransformConfig][source]#
Get the configuration class for the transform.
This is used to validate the configuration of the transform.
YAML configuration
Uses the common TransformConfig fields documented in Core Transform APIs.
Match L2Norm Pattern#
Transform key: match_l2norm_pattern
Source module: tensorrt_llm._torch.auto_deploy.transform.library.l2_norm
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.l2_norm.MatchL2NormPattern(
- config: TransformConfig,
Bases:
BaseTransformMatches L2Norm patterns in the graph and replaces them with torch_l2norm op.
This transform runs in the pattern_matcher stage and standardizes L2Norm patterns to use torch_l2norm op, which can later be fused to a specific backend in the post_load_fusion stage.
- Parameters:
gm – Input graph module to transform.
- Returns:
Transformed graph module with standardized torch_l2norm operations.
YAML configuration
Uses the common TransformConfig fields documented in Core Transform APIs.
Match MoE Routing Pattern#
Transform key: match_moe_routing_pattern
Source module: tensorrt_llm._torch.auto_deploy.transform.library.moe_routing
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.moe_routing.MatchMoeRoutingPattern(
- config: TransformConfig,
Bases:
BaseTransformMatch softmax → topk → renormalize and replace with a fused Triton op.
This transform detects the 3-op MoE routing pattern:
routing_weights = softmax(logits, dtype=float32) routing_weights, indices = topk(routing_weights, k) routing_weights /= routing_weights.sum(keepdim=True)
and replaces it with:
routing_weights, indices = triton_fused_topk_softmax(logits, k)
The fused kernel exploits the equivalence
topk(softmax(x)) / Σ ≡ softmax(topk(x))and avoids computing softmax over all experts.- classmethod get_config_class() Type[TransformConfig][source]#
Get the configuration class for the transform.
This is used to validate the configuration of the transform.
YAML configuration
Uses the common TransformConfig fields documented in Core Transform APIs.
Eliminate Redundant Transposes#
Transform key: eliminate_redundant_transposes
Source module: tensorrt_llm._torch.auto_deploy.transform.library.eliminate_redundant_transposes
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.eliminate_redundant_transposes.EliminateRedundantTransposes(
- config: TransformConfig,
Bases:
BaseTransformEliminate redundant transpose operations in the graph.
This transformation identifies pairs of consecutive transpose operations with the same dimension arguments and removes both operations, as they cancel out.
YAML configuration
Uses the common TransformConfig fields documented in Core Transform APIs.
Quantize Int4 Linear From Config#
Transform key: quantize_int4_linear_from_config
Source module: tensorrt_llm._torch.auto_deploy.transform.library.quantization
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.quantization.INT4LinearQuantizationFromConfig(
- config: TransformConfig,
Bases:
QuantizationConfig-based INT4 (AWQ) for the unified ModelOpt checkpoints.
- static quantize_weight(
- original_weight: Tensor,
Returns the quantized weight from the original unquantized weight.
- static scale_names() List[str][source]#
Returns the list of names of the scales for this quantization.
YAML configuration
Uses the common TransformConfig fields documented in Core Transform APIs.
Quantize Int4 Gptq Linear From Config#
Transform key: quantize_int4_gptq_linear_from_config
Source module: tensorrt_llm._torch.auto_deploy.transform.library.quantization
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.quantization.INT4GPTQLinearQuantizationFromConfig(
- config: TransformConfig,
Bases:
QuantizationConfig-based INT4 GPTQ quantization for GPTQ-quantized checkpoints.
- GPTQ uses:
qweight: [K/8, N] int32 (8 packed int4 values per int32)
qzeros: [G, N/8] int32 (packed zero points)
scales: [G, N] float (per-group scales)
- static quantize_weight(
- original_weight: Tensor,
Returns placeholder qweight tensor [K/8, N] int32.
- static scale_names() List[str][source]#
Returns the list of names of the scales for this quantization.
- static default_scales(
- original_weight_shape: Tuple,
Returns placeholder tensors for GPTQ scales and qzeros.
YAML configuration
Uses the common TransformConfig fields documented in Core Transform APIs.
Quantize FP8 Linear From Config#
Transform key: quantize_fp8_linear_from_config
Source module: tensorrt_llm._torch.auto_deploy.transform.library.quantization
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.quantization.FP8LinearQuantizationFromConfig(
- config: TransformConfig,
Bases:
Quantization- quantize_weight(
- w: Tensor,
Returns the quantized weight from the original unquantized weight.
- default_scales(
- _shape: Tuple,
Returns a dict of the default scale values for this quantization.
YAML configuration
Uses the common TransformConfig fields documented in Core Transform APIs.
Quantize NVFP4 Linear From Config#
Transform key: quantize_nvfp4_linear_from_config
Source module: tensorrt_llm._torch.auto_deploy.transform.library.quantization
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.quantization.NVFP4LinearQuantizationFromConfig(
- config: TransformConfig,
Bases:
Quantization- quantize_weight(
- w: Tensor,
Returns the quantized weight from the original unquantized weight.
- default_scales(
- original_weight_shape: Tuple,
Returns a dict of the default scale values for this quantization.
YAML configuration
Uses the common TransformConfig fields documented in Core Transform APIs.
Quantize Finegrained FP8 Linear From Config#
Transform key: quantize_finegrained_fp8_linear_from_config
Source module: tensorrt_llm._torch.auto_deploy.transform.library.quantization
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.quantization.FineGrainedFP8LinearQuantization(
- config: TransformConfig,
Bases:
QuantizationQuantization transform for FineGrainedFP8 (block-wise FP8) models.
This transform replaces linear ops with the FineGrainedFP8 quantized op. The FineGrained FP8 format uses per-block weight scales (weight_scale_inv) and dynamic input quantization.
- Config format (from HF config.json):
- “quantization_config”: {
“quant_method”: “fp8”, “weight_block_size”: [128, 128], “modules_to_not_convert”: [“lm_head”]
}
- quantize_weight(
- w: Tensor,
Returns the quantized weight from the original unquantized weight.
YAML configuration
Uses the common TransformConfig fields documented in Core Transform APIs.
Quantize Finegrained FP8 MoE#
Transform key: quantize_finegrained_fp8_moe
Source module: tensorrt_llm._torch.auto_deploy.transform.library.quantize_moe
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.quantize_moe.QuantizeFineGrainedFP8MOE(
- config: TransformConfig,
Bases:
QuantizationTraverse gm, find every torch.ops.auto_deploy.torch_moe, and replace it with the FineGrainedFP8 quantized version.
- This transform handles FineGrained FP8 quantization config format:
- “quantization_config”: {
“quant_method”: “fp8”, “weight_block_size”: [128, 128], “modules_to_not_convert”: [“gate”, “lm_head”]
}
- quantize_weight(
- w: Tensor,
Returns the quantized weight from the original unquantized weight.
YAML configuration
Uses the common TransformConfig fields documented in Core Transform APIs.
Quantize FP8 Bmm From Config#
Transform key: quantize_fp8_bmm_from_config
Source module: tensorrt_llm._torch.auto_deploy.transform.library.quantization
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.quantization.FP8BMMQuantizationFromConfig(
- config: TransformConfig,
Bases:
Quantization- quantize_weight(
- w: Tensor,
Returns the quantized weight from the original unquantized weight.
YAML configuration
Uses the common TransformConfig fields documented in Core Transform APIs.
Quantize FP8 From Graph#
Transform key: quantize_fp8_from_graph
Source module: tensorrt_llm._torch.auto_deploy.transform.library.quantization
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.quantization.FP8QuantizationFromGraph(
- config: TransformConfig,
Bases:
FP8LinearQuantizationFromConfigFuse ModelOpt-quantized FP8 linears into fused ops.
YAML configuration
Uses the common TransformConfig fields documented in Core Transform APIs.
Quantize NVFP4 From Graph#
Transform key: quantize_nvfp4_from_graph
Source module: tensorrt_llm._torch.auto_deploy.transform.library.quantization
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.quantization.NVFP4QuantizationFromGraph(
- config: TransformConfig,
Bases:
NVFP4LinearQuantizationFromConfigFuse ModelOpt-quantized NVFP4 linears into fused ops.
YAML configuration
Uses the common TransformConfig fields documented in Core Transform APIs.
Match SwiGLU Pattern#
Transform key: match_swiglu_pattern
Source module: tensorrt_llm._torch.auto_deploy.transform.library.fuse_swiglu
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.fuse_swiglu.MatchSwiGLUPattern(
- config: TransformConfig,
Bases:
BaseTransformMatches SwiGLU MLP patterns and replaces with torch_swiglu_mlp op.
- This transform runs in the pattern_matcher stage and detects the following pattern:
silu(x @ gate.T) * (x @ up.T) @ down.T
And replaces it with a single torch_swiglu_mlp op that can be fused later.
Uses ADPatternMatcherPass for declarative pattern matching.
- classmethod get_config_class() Type[TransformConfig][source]#
Get the configuration class for the transform.
This is used to validate the configuration of the transform.
YAML configuration
Uses the common TransformConfig fields documented in Core Transform APIs.
Match NVFP4 SwiGLU Pattern#
Transform key: match_nvfp4_swiglu_pattern
Source module: tensorrt_llm._torch.auto_deploy.transform.library.fuse_swiglu
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.fuse_swiglu.MatchNVFP4SwiGLUPattern(
- config: TransformConfig,
Bases:
BaseTransformMatches NVFP4 quantized SwiGLU MLP patterns and replaces with torch_nvfp4_swiglu_mlp.
This transform runs in the pattern_matcher stage AFTER quantize_nvfp4_linear_from_config has converted torch_linear_simple ops to torch_fake_quant_nvfp4_linear ops.
- It detects the following NVFP4 pattern:
silu(nvfp4_linear(x, gate)) * nvfp4_linear(x, up) -> nvfp4_linear(down)
And replaces it with a single torch_nvfp4_swiglu_mlp op that can be fused later.
- classmethod get_config_class() Type[TransformConfig][source]#
Get the configuration class for the transform.
This is used to validate the configuration of the transform.
YAML configuration
Uses the common TransformConfig fields documented in Core Transform APIs.
Match Finegrained FP8 SwiGLU Pattern#
Transform key: match_finegrained_fp8_swiglu_pattern
Source module: tensorrt_llm._torch.auto_deploy.transform.library.fuse_swiglu
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.fuse_swiglu.MatchFineGrainedFP8SwiGLUPattern(
- config: TransformConfig,
Bases:
BaseTransformMatches FineGrained FP8 quantized SwiGLU MLP patterns.
This transform runs in the pattern_matcher stage AFTER quantize_finegrained_fp8_linear_from_config has converted torch_linear_simple ops to torch_fake_quant_finegrained_fp8_linear ops.
- It detects the following FineGrained FP8 pattern:
silu(fp8_linear(x, gate)) * fp8_linear(x, up) -> fp8_linear(down)
And replaces it with a single torch_finegrained_fp8_swiglu_mlp op that can be fused later.
Note: This transform runs before sharding. The composite SwiGLU op will NOT be sharded by the sharding transform. Enable only when sharding is not needed or when sharding-aware handling is added separately.
- classmethod get_config_class() Type[TransformConfig][source]#
Get the configuration class for the transform.
This is used to validate the configuration of the transform.
YAML configuration
Uses the common TransformConfig fields documented in Core Transform APIs.
Quantize FP8 MoE#
Transform key: quantize_fp8_moe
Source module: tensorrt_llm._torch.auto_deploy.transform.library.quantize_moe
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.quantize_moe.QuantizeFP8MOE(
- config: TransformConfig,
Bases:
FP8LinearQuantizationFromConfigTraverse gm, find every torch.ops.auto_deploy.torch_moe, and replace it with the quantized version using the quant_algo from quant_config.
YAML configuration
Uses the common TransformConfig fields documented in Core Transform APIs.
Quantize NVFP4 MoE#
Transform key: quantize_nvfp4_moe
Source module: tensorrt_llm._torch.auto_deploy.transform.library.quantize_moe
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.quantize_moe.QuantizeNVFP4MOE(
- config: TransformConfig,
Bases:
NVFP4LinearQuantizationFromConfigTraverse gm, find every torch.ops.auto_deploy.torch_moe, and replace it with the quantized version using the quant_algo from quant_config.
YAML configuration
Uses the common TransformConfig fields documented in Core Transform APIs.
Quantize MXFP4 MoE#
Transform key: quantize_mxfp4_moe
Source module: tensorrt_llm._torch.auto_deploy.transform.library.fused_moe_mxfp4
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.fused_moe_mxfp4.QuantizeMXFP4MOE(
- config: TransformConfig,
Bases:
BaseTransformQuantize MXFP4 MoE: dispatch to triton or trtllm-gen backend.
Replaces
(torch_moe_router -> torch_moe_dense_mlp)with a single fused MoE op. The chosen backend determines the destination op and the parameter layout registered on the experts module:backend="triton"→auto_deploy::triton_mxfp4_moewith raw HF MXFP4 layout (_blocks/_scales/_bias). Lazy weight swizzling happens inside the Triton kernel on first forward.backend="trtllm"→auto_deploy::trtllm_quant_mxfp4_*_moe_fusedwith trtllm-gen prepared layout (fc1_w_trtllm/fc1_w_scale_trtllm/ …). Weight preparation (shuffle + interleave) is done on CPU inside a state-dict pre-hook registered by this transform, so the raw HF tensors are converted before being moved to GPU.
- classmethod get_config_class() Type[TransformConfig][source]#
Get the configuration class for the transform.
This is used to validate the configuration of the transform.
YAML configuration
The fields below can be set under this transform’s entry in the AutoDeploy config YAML.
- pydantic model tensorrt_llm._torch.auto_deploy.transform.library.fused_moe_mxfp4.QuantizeMXFP4MOEConfig[source]
Bases:
TransformConfigConfiguration for
quantize_mxfp4_moe.Show JSON schema
{ "title": "QuantizeMXFP4MOEConfig", "description": "Configuration for ``quantize_mxfp4_moe``.", "type": "object", "properties": { "stage": { "$ref": "#/$defs/Stages", "description": "The stage of the transformation pipeline where this transform should run." }, "run_per_gm": { "default": true, "description": "Whether to run the transform per graph (sub)module or on whole module.", "title": "Run Per Gm", "type": "boolean" }, "enabled": { "default": true, "description": "Whether to enable this transform.", "title": "Enabled", "type": "boolean" }, "skip_on_error": { "default": false, "description": "Whether to skip the transform if an error occurs.", "title": "Skip On Error", "type": "boolean" }, "run_graph_cleanup": { "default": true, "description": "Whether to run graph cleanup/canonicalization after this transform.", "title": "Run Graph Cleanup", "type": "boolean" }, "run_shape_prop": { "default": false, "description": "Whether to run shape propagation after this transform.", "title": "Run Shape Prop", "type": "boolean" }, "requires_clean_graph": { "default": true, "description": "Whether this transform requires the graph to be clean before it is applied.", "title": "Requires Clean Graph", "type": "boolean" }, "requires_shape_prop": { "default": false, "description": "Whether this transform requires shape propagation before it is applied.", "title": "Requires Shape Prop", "type": "boolean" }, "debug_visualize_dir": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.", "title": "Debug Visualize Dir" }, "expect_mem_change": { "default": false, "description": "Whether this transform is expected to cause changes in CUDA memory stats.", "title": "Expect Mem Change", "type": "boolean" }, "backend": { "anyOf": [ { "enum": [ "triton", "trtllm" ], "type": "string" }, { "type": "null" } ], "default": null, "description": "MXFP4 MoE kernel backend selection. When unset (``None``), the default is SM-based: ``trtllm`` on SM>=100 (Blackwell), ``triton`` otherwise. Explicit ``triton`` or ``trtllm`` overrides the default. ``trtllm`` on SM<100 silently falls back to ``triton`` with a warning.", "title": "Backend" }, "trtllm_quant_act": { "default": "mxfp8", "description": "Only used when ``backend='trtllm'``. Activation precision for the trtllm-gen MoE GEMM, passed as ``act_dtype`` to ``trtllm_quant_mxfp4_trtllm_gen_moe_fused``: ``bf16`` dispatches to the bf16 MoE runner (W4A16), ``mxfp8`` pre-quantizes the activation to MXFP8 and dispatches to the MXFP8 MoE runner (W4A8, faster cubin family). Default ``mxfp8`` matches the modeling-side default.", "enum": [ "bf16", "mxfp8" ], "title": "Trtllm Quant Act", "type": "string" } }, "$defs": { "Stages": { "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.", "enum": [ "factory", "export", "post_export", "pattern_matcher", "sharding", "weight_load", "post_load_fusion", "cache_init", "visualize", "compile" ], "title": "Stages", "type": "string" } }, "additionalProperties": true, "required": [ "stage" ] }
- Config:
extra: str = allow
- Fields:
backend (Literal['triton', 'trtllm'] | None)trtllm_quant_act (Literal['bf16', 'mxfp8'])
- field backend: Literal['triton', 'trtllm'] | None = None
MXFP4 MoE kernel backend selection. When unset (
None), the default is SM-based:trtllmon SM>=100 (Blackwell),tritonotherwise. Explicittritonortrtllmoverrides the default.trtllmon SM<100 silently falls back totritonwith a warning.
- field trtllm_quant_act: Literal['bf16', 'mxfp8'] = 'mxfp8'
Only used when
backend='trtllm'. Activation precision for the trtllm-gen MoE GEMM, passed asact_dtypetotrtllm_quant_mxfp4_trtllm_gen_moe_fused:bf16dispatches to the bf16 MoE runner (W4A16),mxfp8pre-quantizes the activation to MXFP8 and dispatches to the MXFP8 MoE runner (W4A8, faster cubin family). Defaultmxfp8matches the modeling-side default.