Post-Load Fusion Stage#
Post-load fusion applies performance optimizations that need loaded weights, device tensors, or the final post-sharding graph structure. This stage includes kernel fusions for quantized linear layers, MoE, normalization, activation, RoPE, and related inference patterns.
Fuse Gemms Mixed Children#
Transform key: fuse_gemms_mixed_children
Source module: tensorrt_llm._torch.auto_deploy.transform.library.fusion
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.fusion.FuseGemmsMixedChildren(
- config: TransformConfig,
Bases:
BaseTransformFuse linear projections sharing the same input, even when the parent has non-linear users (e.g., shape access).
This is a relaxed variant of FuseGemms: it does NOT require all children of the parent to be linear ops — only that at least 2 linear children exist. The fused output is split via torch.narrow (zero-copy view).
Handles both non-quantized and quantized (FP8, FP4) linear ops. Nodes are grouped by (parent, quantization scheme) so only linears with the same parent AND the same op target are fused together.
YAML configuration
Uses the common TransformConfig fields documented in Core Transform APIs.
Fuse Gemms#
Transform key: fuse_gemms
Source module: tensorrt_llm._torch.auto_deploy.transform.library.fusion
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.fusion.FuseGemms(
- config: TransformConfig,
Bases:
BaseTransform
YAML configuration
Uses the common TransformConfig fields documented in Core Transform APIs.
Fuse Fp4 Gemms#
Transform key: fuse_fp4_gemms
Source module: tensorrt_llm._torch.auto_deploy.transform.library.fusion
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.fusion.FuseFP4Gemms(
- config: TransformConfig,
Bases:
QuantizationFusionMixin,BaseTransform
YAML configuration
Uses the common TransformConfig fields documented in Core Transform APIs.
Fuse FP8 Gemms#
Transform key: fuse_fp8_gemms
Source module: tensorrt_llm._torch.auto_deploy.transform.library.fusion
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.fusion.FuseFP8Gemms(
- config: TransformConfig,
Bases:
QuantizationFusionMixin,BaseTransform
YAML configuration
Uses the common TransformConfig fields documented in Core Transform APIs.
Fuse FP8 Linear#
Transform key: fuse_fp8_linear
Source module: tensorrt_llm._torch.auto_deploy.transform.library.fuse_quant
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.fuse_quant.FuseFP8Linear(
- config: TransformConfig,
Bases:
BaseTransformMatches and replaces FP8 fake quantized linear ops with fused torch backend ops.
- classmethod get_config_class() Type[TransformConfig][source]#
Get the configuration class for the transform.
This is used to validate the configuration of the transform.
YAML configuration
The fields below can be set under this transform’s entry in the AutoDeploy config YAML.
- pydantic model tensorrt_llm._torch.auto_deploy.transform.library.fuse_quant.FuseFP8LinearConfig[source]
Bases:
TransformConfigConfiguration for FP8 linear fusion transform.
Show JSON schema
{ "title": "FuseFP8LinearConfig", "description": "Configuration for FP8 linear fusion transform.", "type": "object", "properties": { "stage": { "$ref": "#/$defs/Stages", "description": "The stage of the transformation pipeline where this transform should run." }, "run_per_gm": { "default": true, "description": "Whether to run the transform per graph (sub)module or on whole module.", "title": "Run Per Gm", "type": "boolean" }, "enabled": { "default": true, "description": "Whether to enable this transform.", "title": "Enabled", "type": "boolean" }, "skip_on_error": { "default": false, "description": "Whether to skip the transform if an error occurs.", "title": "Skip On Error", "type": "boolean" }, "run_graph_cleanup": { "default": true, "description": "Whether to run graph cleanup/canonicalization after this transform.", "title": "Run Graph Cleanup", "type": "boolean" }, "run_shape_prop": { "default": false, "description": "Whether to run shape propagation after this transform.", "title": "Run Shape Prop", "type": "boolean" }, "requires_clean_graph": { "default": true, "description": "Whether this transform requires the graph to be clean before it is applied.", "title": "Requires Clean Graph", "type": "boolean" }, "requires_shape_prop": { "default": false, "description": "Whether this transform requires shape propagation before it is applied.", "title": "Requires Shape Prop", "type": "boolean" }, "debug_visualize_dir": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.", "title": "Debug Visualize Dir" }, "expect_mem_change": { "default": false, "description": "Whether this transform is expected to cause changes in CUDA memory stats.", "title": "Expect Mem Change", "type": "boolean" }, "backend": { "default": "torch", "description": "Backend to use for FP8 linear computation (default: 'torch').", "title": "Backend", "type": "string" } }, "$defs": { "Stages": { "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.", "enum": [ "factory", "export", "post_export", "pattern_matcher", "sharding", "weight_load", "post_load_fusion", "cache_init", "visualize", "compile" ], "title": "Stages", "type": "string" } }, "additionalProperties": true, "required": [ "stage" ] }
- Config:
extra: str = allow
- Fields:
backend (str)
- field backend: str = 'torch'
Backend to use for FP8 linear computation (default: ‘torch’).
Fuse TRT-LLM Attn Quant FP8#
Transform key: fuse_trtllm_attn_quant_fp8
Source module: tensorrt_llm._torch.auto_deploy.transform.library.fuse_trtllm_attention_quant_fp8
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.fuse_trtllm_attention_quant_fp8.FuseTrtllmAttentionQuantFP8(
- config: TransformConfig,
Bases:
BaseTransformPrepare attention->FP8 linear path so TRTLLM attention can emit FP8 directly.
- classmethod get_config_class() Type[TransformConfig][source]#
Get the configuration class for the transform.
This is used to validate the configuration of the transform.
YAML configuration
Uses the common TransformConfig fields documented in Core Transform APIs.
Fuse NVFP4 Linear#
Transform key: fuse_nvfp4_linear
Source module: tensorrt_llm._torch.auto_deploy.transform.library.fuse_quant
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.fuse_quant.FuseNVFP4Linear(
- config: TransformConfig,
Bases:
BaseTransformMatches and replaces NVFP4 fake quantized linear ops with fused TensorRT-LLM ops.
- classmethod get_config_class() Type[TransformConfig][source]#
Get the configuration class for the transform.
This is used to validate the configuration of the transform.
YAML configuration
The fields below can be set under this transform’s entry in the AutoDeploy config YAML.
- pydantic model tensorrt_llm._torch.auto_deploy.transform.library.fuse_quant.FuseNVFP4LinearConfig[source]
Bases:
TransformConfigConfiguration for NVFP4 linear fusion transform.
Show JSON schema
{ "title": "FuseNVFP4LinearConfig", "description": "Configuration for NVFP4 linear fusion transform.", "type": "object", "properties": { "stage": { "$ref": "#/$defs/Stages", "description": "The stage of the transformation pipeline where this transform should run." }, "run_per_gm": { "default": true, "description": "Whether to run the transform per graph (sub)module or on whole module.", "title": "Run Per Gm", "type": "boolean" }, "enabled": { "default": true, "description": "Whether to enable this transform.", "title": "Enabled", "type": "boolean" }, "skip_on_error": { "default": false, "description": "Whether to skip the transform if an error occurs.", "title": "Skip On Error", "type": "boolean" }, "run_graph_cleanup": { "default": true, "description": "Whether to run graph cleanup/canonicalization after this transform.", "title": "Run Graph Cleanup", "type": "boolean" }, "run_shape_prop": { "default": false, "description": "Whether to run shape propagation after this transform.", "title": "Run Shape Prop", "type": "boolean" }, "requires_clean_graph": { "default": true, "description": "Whether this transform requires the graph to be clean before it is applied.", "title": "Requires Clean Graph", "type": "boolean" }, "requires_shape_prop": { "default": false, "description": "Whether this transform requires shape propagation before it is applied.", "title": "Requires Shape Prop", "type": "boolean" }, "debug_visualize_dir": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.", "title": "Debug Visualize Dir" }, "expect_mem_change": { "default": false, "description": "Whether this transform is expected to cause changes in CUDA memory stats.", "title": "Expect Mem Change", "type": "boolean" }, "backend": { "default": "trtllm", "description": "Backend to use for NVFP4 linear computation (default: 'trtllm').", "title": "Backend", "type": "string" } }, "$defs": { "Stages": { "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.", "enum": [ "factory", "export", "post_export", "pattern_matcher", "sharding", "weight_load", "post_load_fusion", "cache_init", "visualize", "compile" ], "title": "Stages", "type": "string" } }, "additionalProperties": true, "required": [ "stage" ] }
- Config:
extra: str = allow
- Fields:
backend (str)
- field backend: str = 'trtllm'
Backend to use for NVFP4 linear computation (default: ‘trtllm’).
Fuse Relu2 Quant NVFP4#
Transform key: fuse_relu2_quant_nvfp4
Source module: tensorrt_llm._torch.auto_deploy.transform.library.fuse_relu2_quant_nvfp4
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.fuse_relu2_quant_nvfp4.FuseRelu2QuantNVFP4(
- config: TransformConfig,
Bases:
BaseTransformFuse matcher-supported ReLU² + NVFP4 quantization patterns.
- classmethod get_config_class() Type[TransformConfig][source]#
Get the configuration class for the transform.
This is used to validate the configuration of the transform.
YAML configuration
Uses the common TransformConfig fields documented in Core Transform APIs.
Fuse NVFP4 SwiGLU#
Transform key: fuse_nvfp4_swiglu
Source module: tensorrt_llm._torch.auto_deploy.transform.library.fuse_swiglu
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.fuse_swiglu.FuseNVFP4SwiGLU(
- config: TransformConfig,
Bases:
BaseTransformFuses torch_nvfp4_swiglu_mlp ops by concatenating gate and up FP4 weights.
This transform runs in the post_load_fusion stage and replaces torch_nvfp4_swiglu_mlp ops with fused_nvfp4_swiglu_mlp ops that use a single concatenated gate+up weight matrix.
FP4 weight fusion: - gate+up packed weights are concatenated along dim=0 - gate+up per-block weight scales are concatenated along dim=0 - gate+up input_scale and alpha must match (shared input)
- classmethod get_config_class() Type[TransformConfig][source]#
Get the configuration class for the transform.
This is used to validate the configuration of the transform.
YAML configuration
Uses the common TransformConfig fields documented in Core Transform APIs.
Fuse Finegrained FP8 SwiGLU#
Transform key: fuse_finegrained_fp8_swiglu
Source module: tensorrt_llm._torch.auto_deploy.transform.library.fuse_swiglu
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.fuse_swiglu.FuseFineGrainedFP8SwiGLU(
- config: TransformConfig,
Bases:
BaseTransformFuses torch_finegrained_fp8_swiglu_mlp ops by concatenating gate and up FP8 weights.
This transform runs in the post_load_fusion stage and replaces torch_finegrained_fp8_swiglu_mlp ops with one of:
fused_finegrained_fp8_swiglu_mlp— default FP32 per-block scale path usingtrtllm_finegrained_fp8_linearinternally.fused_finegrained_fp8_deepgemm_swiglu_mlp— Blackwell (SM100f) UE8M0 path usingtrtllm_fp8_deepgemminternally. Selected at compile time when the concatenated gate+up and down weight scales are UE8M0 packed int (set byFineGrainedFP8LinearQuantization.post_load_hook).
FP8 weight fusion: - gate+up FP8 weights are concatenated along dim=0: [N, K] -> [2N, K] - gate+up per-block weight scales are concatenated along dim=0:
[N/128, K/128] -> [2N/128, K/128]
- classmethod get_config_class() Type[TransformConfig][source]#
Get the configuration class for the transform.
This is used to validate the configuration of the transform.
YAML configuration
Uses the common TransformConfig fields documented in Core Transform APIs.
Fuse Finegrained FP8 Linear#
Transform key: fuse_finegrained_fp8_linear
Source module: tensorrt_llm._torch.auto_deploy.transform.library.fuse_quant
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.fuse_quant.FuseFineGrainedFP8Linear(
- config: TransformConfig,
Bases:
BaseTransformMatches and replaces FineGrained FP8 fake quantized linear ops with TRT-LLM ops.
- Two-stage pipeline:
Pattern matcher rewrites
torch_fake_quant_finegrained_fp8_linear(HuggingFace triton kernel) totrtllm_finegrained_fp8_linear(TRT-LLMfp8_block_scaling_gemmwith FP32 per-block scales).A compile-time dispatch pass further rewrites any nodes whose
weight_scalebuffer is UE8M0 packed int (produced byFineGrainedFP8LinearQuantization.post_load_hookon SM100f) to the dedicatedtrtllm_fp8_deepgemmop. Keeping the SM100f/UE8M0 path in a separate op avoids per-call hardware / dtype branching inside the runtime op.
Used for models like MiniMax M2 and DeepSeek that use HuggingFace’s FineGrained FP8 quantization format with 128x128 block sizes.
- classmethod get_config_class() Type[TransformConfig][source]#
Get the configuration class for the transform.
This is used to validate the configuration of the transform.
YAML configuration
The fields below can be set under this transform’s entry in the AutoDeploy config YAML.
- pydantic model tensorrt_llm._torch.auto_deploy.transform.library.fuse_quant.FuseFineGrainedFP8LinearConfig[source]
Bases:
TransformConfigConfiguration for FineGrained FP8 linear fusion transform.
Show JSON schema
{ "title": "FuseFineGrainedFP8LinearConfig", "description": "Configuration for FineGrained FP8 linear fusion transform.", "type": "object", "properties": { "stage": { "$ref": "#/$defs/Stages", "description": "The stage of the transformation pipeline where this transform should run." }, "run_per_gm": { "default": true, "description": "Whether to run the transform per graph (sub)module or on whole module.", "title": "Run Per Gm", "type": "boolean" }, "enabled": { "default": true, "description": "Whether to enable this transform.", "title": "Enabled", "type": "boolean" }, "skip_on_error": { "default": false, "description": "Whether to skip the transform if an error occurs.", "title": "Skip On Error", "type": "boolean" }, "run_graph_cleanup": { "default": true, "description": "Whether to run graph cleanup/canonicalization after this transform.", "title": "Run Graph Cleanup", "type": "boolean" }, "run_shape_prop": { "default": false, "description": "Whether to run shape propagation after this transform.", "title": "Run Shape Prop", "type": "boolean" }, "requires_clean_graph": { "default": true, "description": "Whether this transform requires the graph to be clean before it is applied.", "title": "Requires Clean Graph", "type": "boolean" }, "requires_shape_prop": { "default": false, "description": "Whether this transform requires shape propagation before it is applied.", "title": "Requires Shape Prop", "type": "boolean" }, "debug_visualize_dir": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.", "title": "Debug Visualize Dir" }, "expect_mem_change": { "default": false, "description": "Whether this transform is expected to cause changes in CUDA memory stats.", "title": "Expect Mem Change", "type": "boolean" }, "backend": { "default": "trtllm", "description": "Backend to use for FineGrained FP8 linear computation (default: 'trtllm').", "title": "Backend", "type": "string" } }, "$defs": { "Stages": { "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.", "enum": [ "factory", "export", "post_export", "pattern_matcher", "sharding", "weight_load", "post_load_fusion", "cache_init", "visualize", "compile" ], "title": "Stages", "type": "string" } }, "additionalProperties": true, "required": [ "stage" ] }
- Config:
extra: str = allow
- Fields:
backend (str)
- field backend: str = 'trtllm'
Backend to use for FineGrained FP8 linear computation (default: ‘trtllm’).
Fuse MXFP4 MoE#
Transform key: fuse_mxfp4_moe
Source module: tensorrt_llm._torch.auto_deploy.transform.library.fused_moe_mxfp4
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.fused_moe_mxfp4.FuseMXFP4Moe(
- config: TransformConfig,
Bases:
BaseTransformPOST_LOAD_FUSION transform: GPU-side MXFP4 MoE weight prep for the trtllm-gen backend.
Runs after
QuantizeMXFP4MOEregistered raw HF MXFP4 buffers and the EP-slice load hook populated them. For eachtrtllm_quant_mxfp4_trtllm_gen_moe_fusednode, callsprepare_trtllm_gen_moe_mxfp4_weights()on the loaded GPU tensors to produce the kernel layout, swaps the op args to the prepared params, and deletes the raw buffers.Skipped when the op already references prepared params (idempotent).
- classmethod get_config_class() Type[TransformConfig][source]#
Get the configuration class for the transform.
This is used to validate the configuration of the transform.
YAML configuration
The fields below can be set under this transform’s entry in the AutoDeploy config YAML.
- pydantic model tensorrt_llm._torch.auto_deploy.transform.library.fused_moe_mxfp4.FuseMXFP4MoeConfig[source]
Bases:
TransformConfigConfiguration for
fuse_mxfp4_moe(POST_LOAD_FUSION).Show JSON schema
{ "title": "FuseMXFP4MoeConfig", "description": "Configuration for ``fuse_mxfp4_moe`` (POST_LOAD_FUSION).", "type": "object", "properties": { "stage": { "$ref": "#/$defs/Stages", "description": "The stage of the transformation pipeline where this transform should run." }, "run_per_gm": { "default": true, "description": "Whether to run the transform per graph (sub)module or on whole module.", "title": "Run Per Gm", "type": "boolean" }, "enabled": { "default": true, "description": "Whether to enable this transform.", "title": "Enabled", "type": "boolean" }, "skip_on_error": { "default": false, "description": "Whether to skip the transform if an error occurs.", "title": "Skip On Error", "type": "boolean" }, "run_graph_cleanup": { "default": true, "description": "Whether to run graph cleanup/canonicalization after this transform.", "title": "Run Graph Cleanup", "type": "boolean" }, "run_shape_prop": { "default": false, "description": "Whether to run shape propagation after this transform.", "title": "Run Shape Prop", "type": "boolean" }, "requires_clean_graph": { "default": true, "description": "Whether this transform requires the graph to be clean before it is applied.", "title": "Requires Clean Graph", "type": "boolean" }, "requires_shape_prop": { "default": false, "description": "Whether this transform requires shape propagation before it is applied.", "title": "Requires Shape Prop", "type": "boolean" }, "debug_visualize_dir": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.", "title": "Debug Visualize Dir" }, "expect_mem_change": { "default": false, "description": "Whether this transform is expected to cause changes in CUDA memory stats.", "title": "Expect Mem Change", "type": "boolean" } }, "$defs": { "Stages": { "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.", "enum": [ "factory", "export", "post_export", "pattern_matcher", "sharding", "weight_load", "post_load_fusion", "cache_init", "visualize", "compile" ], "title": "Stages", "type": "string" } }, "additionalProperties": true, "required": [ "stage" ] }
- Config:
extra: str = allow
- Fields:
Fuse MoE#
Transform key: fuse_moe
Source module: tensorrt_llm._torch.auto_deploy.transform.library.fused_moe
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.fused_moe.FuseMoe(
- config: TransformConfig,
Bases:
BaseTransformScan the FX graph and replace all calls to torch.ops.auto_deploy.torch_moe and torch.ops.auto_deploy.torch_moe_fused with torch.ops.auto_deploy.trtllm_moe_fused.
- classmethod get_config_class() Type[TransformConfig][source]#
Get the configuration class for the transform.
This is used to validate the configuration of the transform.
YAML configuration
The fields below can be set under this transform’s entry in the AutoDeploy config YAML.
- pydantic model tensorrt_llm._torch.auto_deploy.transform.library.fused_moe.FuseMoeConfig[source]
Bases:
TransformConfigConfiguration for MoE fusion transform.
Show JSON schema
{ "title": "FuseMoeConfig", "description": "Configuration for MoE fusion transform.", "type": "object", "properties": { "stage": { "$ref": "#/$defs/Stages", "description": "The stage of the transformation pipeline where this transform should run." }, "run_per_gm": { "default": true, "description": "Whether to run the transform per graph (sub)module or on whole module.", "title": "Run Per Gm", "type": "boolean" }, "enabled": { "default": true, "description": "Whether to enable this transform.", "title": "Enabled", "type": "boolean" }, "skip_on_error": { "default": false, "description": "Whether to skip the transform if an error occurs.", "title": "Skip On Error", "type": "boolean" }, "run_graph_cleanup": { "default": true, "description": "Whether to run graph cleanup/canonicalization after this transform.", "title": "Run Graph Cleanup", "type": "boolean" }, "run_shape_prop": { "default": false, "description": "Whether to run shape propagation after this transform.", "title": "Run Shape Prop", "type": "boolean" }, "requires_clean_graph": { "default": true, "description": "Whether this transform requires the graph to be clean before it is applied.", "title": "Requires Clean Graph", "type": "boolean" }, "requires_shape_prop": { "default": false, "description": "Whether this transform requires shape propagation before it is applied.", "title": "Requires Shape Prop", "type": "boolean" }, "debug_visualize_dir": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.", "title": "Debug Visualize Dir" }, "expect_mem_change": { "default": false, "description": "Whether this transform is expected to cause changes in CUDA memory stats.", "title": "Expect Mem Change", "type": "boolean" }, "backend": { "default": "auto", "description": "Backend to use for MoE computation ('auto', 'trtllm' or 'triton'. default: 'auto').", "title": "Backend", "type": "string" } }, "$defs": { "Stages": { "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.", "enum": [ "factory", "export", "post_export", "pattern_matcher", "sharding", "weight_load", "post_load_fusion", "cache_init", "visualize", "compile" ], "title": "Stages", "type": "string" } }, "additionalProperties": true, "required": [ "stage" ] }
- Config:
extra: str = allow
- Fields:
backend (str)
- field backend: str = 'auto'
Backend to use for MoE computation (‘auto’, ‘trtllm’ or ‘triton’. default: ‘auto’).
Fuse FP8 MoE#
Transform key: fuse_fp8_moe
Source module: tensorrt_llm._torch.auto_deploy.transform.library.fused_moe
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.fused_moe.FuseFP8Moe(
- config: TransformConfig,
Bases:
BaseTransformStack per-expert FP8 MoE weights and scales to avoid runtime stacking overhead. This runs after weights are loaded, similar to FuseMoe for unquantized MoE.
- classmethod get_config_class() Type[TransformConfig][source]#
Get the configuration class for the transform.
This is used to validate the configuration of the transform.
YAML configuration
The fields below can be set under this transform’s entry in the AutoDeploy config YAML.
- pydantic model tensorrt_llm._torch.auto_deploy.transform.library.fused_moe.FuseFP8MoeConfig[source]
Bases:
TransformConfigConfiguration for FP8 MoE fusion transform.
Show JSON schema
{ "title": "FuseFP8MoeConfig", "description": "Configuration for FP8 MoE fusion transform.", "type": "object", "properties": { "stage": { "$ref": "#/$defs/Stages", "description": "The stage of the transformation pipeline where this transform should run." }, "run_per_gm": { "default": true, "description": "Whether to run the transform per graph (sub)module or on whole module.", "title": "Run Per Gm", "type": "boolean" }, "enabled": { "default": true, "description": "Whether to enable this transform.", "title": "Enabled", "type": "boolean" }, "skip_on_error": { "default": false, "description": "Whether to skip the transform if an error occurs.", "title": "Skip On Error", "type": "boolean" }, "run_graph_cleanup": { "default": true, "description": "Whether to run graph cleanup/canonicalization after this transform.", "title": "Run Graph Cleanup", "type": "boolean" }, "run_shape_prop": { "default": false, "description": "Whether to run shape propagation after this transform.", "title": "Run Shape Prop", "type": "boolean" }, "requires_clean_graph": { "default": true, "description": "Whether this transform requires the graph to be clean before it is applied.", "title": "Requires Clean Graph", "type": "boolean" }, "requires_shape_prop": { "default": false, "description": "Whether this transform requires shape propagation before it is applied.", "title": "Requires Shape Prop", "type": "boolean" }, "debug_visualize_dir": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.", "title": "Debug Visualize Dir" }, "expect_mem_change": { "default": false, "description": "Whether this transform is expected to cause changes in CUDA memory stats.", "title": "Expect Mem Change", "type": "boolean" }, "backend": { "default": "auto", "description": "Backend to use for FP8 MoE computation ('auto', 'trtllm' or 'triton'. default: 'auto').", "title": "Backend", "type": "string" }, "allow_different_input_scales": { "default": false, "description": "If False (default), assert that all experts have identical input scales and fail if not. If True, allow different per-expert input scales by using max(input_scale) for quantization. This matches TRT-LLM manual backend behavior but may impact accuracy if scales differ significantly.", "title": "Allow Different Input Scales", "type": "boolean" } }, "$defs": { "Stages": { "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.", "enum": [ "factory", "export", "post_export", "pattern_matcher", "sharding", "weight_load", "post_load_fusion", "cache_init", "visualize", "compile" ], "title": "Stages", "type": "string" } }, "additionalProperties": true, "required": [ "stage" ] }
- Config:
extra: str = allow
- Fields:
allow_different_input_scales (bool)backend (str)
- field allow_different_input_scales: bool = False
If False (default), assert that all experts have identical input scales and fail if not. If True, allow different per-expert input scales by using max(input_scale) for quantization. This matches TRT-LLM manual backend behavior but may impact accuracy if scales differ significantly.
- field backend: str = 'auto'
Backend to use for FP8 MoE computation (‘auto’, ‘trtllm’ or ‘triton’. default: ‘auto’).
Fuse Finegrained FP8 MoE#
Transform key: fuse_finegrained_fp8_moe
Source module: tensorrt_llm._torch.auto_deploy.transform.library.fused_moe
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.fused_moe.FuseFineGrainedFP8Moe(
- config: TransformConfig,
Bases:
BaseTransformStack per-expert FineGrainedFP8 MoE weights and block scales.
This transform replaces torch_quant_finegrained_fp8_moe ops with the fused trtllm_quant_finegrained_fp8_moe_fused kernel which is cudagraph-compatible.
YAML configuration
Uses the common TransformConfig fields documented in Core Transform APIs.
Fuse NVFP4 MoE#
Transform key: fuse_nvfp4_moe
Source module: tensorrt_llm._torch.auto_deploy.transform.library.fused_moe
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.fused_moe.FuseNVFP4Moe(
- config: TransformConfig,
Bases:
BaseTransformStack per-expert NVFP4 MoE weights and scales to avoid runtime stacking overhead. This runs after weights are loaded, similar to FuseFP8Moe.
- classmethod get_config_class() Type[TransformConfig][source]#
Get the configuration class for the transform.
This is used to validate the configuration of the transform.
YAML configuration
The fields below can be set under this transform’s entry in the AutoDeploy config YAML.
- pydantic model tensorrt_llm._torch.auto_deploy.transform.library.fused_moe.FuseNVFP4MoeConfig[source]
Bases:
TransformConfigConfiguration for NVFP4 MoE fusion transform.
Show JSON schema
{ "title": "FuseNVFP4MoeConfig", "description": "Configuration for NVFP4 MoE fusion transform.", "type": "object", "properties": { "stage": { "$ref": "#/$defs/Stages", "description": "The stage of the transformation pipeline where this transform should run." }, "run_per_gm": { "default": true, "description": "Whether to run the transform per graph (sub)module or on whole module.", "title": "Run Per Gm", "type": "boolean" }, "enabled": { "default": true, "description": "Whether to enable this transform.", "title": "Enabled", "type": "boolean" }, "skip_on_error": { "default": false, "description": "Whether to skip the transform if an error occurs.", "title": "Skip On Error", "type": "boolean" }, "run_graph_cleanup": { "default": true, "description": "Whether to run graph cleanup/canonicalization after this transform.", "title": "Run Graph Cleanup", "type": "boolean" }, "run_shape_prop": { "default": false, "description": "Whether to run shape propagation after this transform.", "title": "Run Shape Prop", "type": "boolean" }, "requires_clean_graph": { "default": true, "description": "Whether this transform requires the graph to be clean before it is applied.", "title": "Requires Clean Graph", "type": "boolean" }, "requires_shape_prop": { "default": false, "description": "Whether this transform requires shape propagation before it is applied.", "title": "Requires Shape Prop", "type": "boolean" }, "debug_visualize_dir": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.", "title": "Debug Visualize Dir" }, "expect_mem_change": { "default": false, "description": "Whether this transform is expected to cause changes in CUDA memory stats.", "title": "Expect Mem Change", "type": "boolean" }, "backend": { "default": "cutlass", "description": "Backend to use for NVFP4 MoE computation ('cutlass' or 'trtllm_gen').", "enum": [ "cutlass", "trtllm_gen" ], "title": "Backend", "type": "string" }, "allow_different_input_scales": { "default": false, "description": "If False (default), assert that all experts have identical input scales and fail if not. If True, allow different per-expert input scales by using min(input_scale) for quantization. Note: NVFP4 uses min() (not max like FP8) because scales are in kernel format (2688/amax): smaller scale = larger amax = larger dynamic range. This may impact accuracy if scales differ significantly.", "title": "Allow Different Input Scales", "type": "boolean" }, "reverse_interleaved_input_scales": { "default": true, "description": "If True, assumes incoming NVFP4 block scales are already interleaved (as produced by quantization load_hook), applies block_scale_interleave_reverse before TRTLLM-Gen shuffle+interleave. Only used when backend='trtllm_gen'.", "title": "Reverse Interleaved Input Scales", "type": "boolean" }, "enable_trtllm_gen_internal_routing": { "default": true, "description": "If True and backend='trtllm_gen', pass router logits and routing bias directly to TRTLLM-Gen MoE when the routing tensors come from trtllm.noaux_tc_op.", "title": "Enable Trtllm Gen Internal Routing", "type": "boolean" } }, "$defs": { "Stages": { "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.", "enum": [ "factory", "export", "post_export", "pattern_matcher", "sharding", "weight_load", "post_load_fusion", "cache_init", "visualize", "compile" ], "title": "Stages", "type": "string" } }, "additionalProperties": true, "required": [ "stage" ] }
- Config:
extra: str = allow
- Fields:
allow_different_input_scales (bool)backend (Literal['cutlass', 'trtllm_gen'])enable_trtllm_gen_internal_routing (bool)reverse_interleaved_input_scales (bool)
- field allow_different_input_scales: bool = False
If False (default), assert that all experts have identical input scales and fail if not. If True, allow different per-expert input scales by using min(input_scale) for quantization. Note: NVFP4 uses min() (not max like FP8) because scales are in kernel format (2688/amax): smaller scale = larger amax = larger dynamic range. This may impact accuracy if scales differ significantly.
- field backend: Literal['cutlass', 'trtllm_gen'] = 'cutlass'
Backend to use for NVFP4 MoE computation (‘cutlass’ or ‘trtllm_gen’).
- field enable_trtllm_gen_internal_routing: bool = True
If True and backend=’trtllm_gen’, pass router logits and routing bias directly to TRTLLM-Gen MoE when the routing tensors come from trtllm.noaux_tc_op.
- field reverse_interleaved_input_scales: bool = True
If True, assumes incoming NVFP4 block scales are already interleaved (as produced by quantization load_hook), applies block_scale_interleave_reverse before TRTLLM-Gen shuffle+interleave. Only used when backend=’trtllm_gen’.
Fuse Allreduce Residual RMSNorm#
Transform key: fuse_allreduce_residual_rmsnorm
Source module: tensorrt_llm._torch.auto_deploy.transform.library.collectives
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.collectives.FuseAllreduceResidualRMSNorm(
- config: TransformConfig,
Bases:
BaseTransformFuse (allreduce + residual add + RMSNorm) into one fused op with tuple output.
This transform only applies when TRT-LLM ops are used (MPI mode), as it provides optimized fused kernels. The torch backend (demollm mode) does not benefit from this fusion and uses unfused operations.
Note: This transform expects torch_rmsnorm ops in the graph, which are created by the match_rmsnorm_pattern transform that runs earlier in the pipeline.
YAML configuration
Uses the common TransformConfig fields documented in Core Transform APIs.
Fuse RMSNorm#
Transform key: fuse_rmsnorm
Source module: tensorrt_llm._torch.auto_deploy.transform.library.rms_norm
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.rms_norm.FuseRMSNorm(
- config: TransformConfig,
Bases:
BaseTransformFuses torch_rmsnorm ops with the selected backend implementation.
This transform runs in the post_load_fusion stage and replaces torch_rmsnorm ops with the specified backend implementation (flashinfer, triton, or torch).
- Parameters:
gm – Input graph module to transform.
rmsnorm_backend – Backend to use for regular RMSNorm computation (“flashinfer”, “triton”, or “torch”).
gated_rmsnorm_backend – Backend to use for gated RMSNorm computation (currently only “triton”).
- Returns:
Transformed graph module with backend-specific RMSNorm operations.
- classmethod get_config_class() Type[TransformConfig][source]#
Get the configuration class for the transform.
This is used to validate the configuration of the transform.
YAML configuration
The fields below can be set under this transform’s entry in the AutoDeploy config YAML.
- pydantic model tensorrt_llm._torch.auto_deploy.transform.library.rms_norm.FuseRMSNormConfig[source]
Bases:
TransformConfigConfiguration for the RMSNorm fusion transform.
Show JSON schema
{ "title": "FuseRMSNormConfig", "description": "Configuration for the RMSNorm fusion transform.", "type": "object", "properties": { "stage": { "$ref": "#/$defs/Stages", "description": "The stage of the transformation pipeline where this transform should run." }, "run_per_gm": { "default": true, "description": "Whether to run the transform per graph (sub)module or on whole module.", "title": "Run Per Gm", "type": "boolean" }, "enabled": { "default": true, "description": "Whether to enable this transform.", "title": "Enabled", "type": "boolean" }, "skip_on_error": { "default": false, "description": "Whether to skip the transform if an error occurs.", "title": "Skip On Error", "type": "boolean" }, "run_graph_cleanup": { "default": true, "description": "Whether to run graph cleanup/canonicalization after this transform.", "title": "Run Graph Cleanup", "type": "boolean" }, "run_shape_prop": { "default": false, "description": "Whether to run shape propagation after this transform.", "title": "Run Shape Prop", "type": "boolean" }, "requires_clean_graph": { "default": true, "description": "Whether this transform requires the graph to be clean before it is applied.", "title": "Requires Clean Graph", "type": "boolean" }, "requires_shape_prop": { "default": false, "description": "Whether this transform requires shape propagation before it is applied.", "title": "Requires Shape Prop", "type": "boolean" }, "debug_visualize_dir": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.", "title": "Debug Visualize Dir" }, "expect_mem_change": { "default": false, "description": "Whether this transform is expected to cause changes in CUDA memory stats.", "title": "Expect Mem Change", "type": "boolean" }, "rmsnorm_backend": { "default": "flashinfer", "description": "Backend to use for RMSNorm computation ('flashinfer', 'triton', or 'torch').", "title": "Rmsnorm Backend", "type": "string" }, "gated_rmsnorm_backend": { "default": "triton", "description": "Backend to use for gated RMSNorm computation (currently only 'triton').", "title": "Gated Rmsnorm Backend", "type": "string" } }, "$defs": { "Stages": { "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.", "enum": [ "factory", "export", "post_export", "pattern_matcher", "sharding", "weight_load", "post_load_fusion", "cache_init", "visualize", "compile" ], "title": "Stages", "type": "string" } }, "additionalProperties": true, "required": [ "stage" ] }
- Config:
extra: str = allow
- Fields:
gated_rmsnorm_backend (str)rmsnorm_backend (str)
- field gated_rmsnorm_backend: str = 'triton'
Backend to use for gated RMSNorm computation (currently only ‘triton’).
- field rmsnorm_backend: str = 'flashinfer'
Backend to use for RMSNorm computation (‘flashinfer’, ‘triton’, or ‘torch’).
Fuse RMSNorm Quant NVFP4#
Transform key: fuse_rmsnorm_quant_nvfp4
Source module: tensorrt_llm._torch.auto_deploy.transform.library.fuse_rmsnorm_quant_nvfp4
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.fuse_rmsnorm_quant_nvfp4.FuseRMSNormQuantNVFP4(
- config: TransformConfig,
Bases:
BaseTransformFuse NVFP4 quantization into RMSNorm producers where TRT-LLM kernels exist.
- classmethod get_config_class() Type[TransformConfig][source]#
Get the configuration class for the transform.
This is used to validate the configuration of the transform.
YAML configuration
Uses the common TransformConfig fields documented in Core Transform APIs.
Fuse GDN Gating#
Transform key: fuse_gdn_gating
Source module: tensorrt_llm._torch.auto_deploy.transform.library.fuse_gdn_gating
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.fuse_gdn_gating.FuseGdnGating(
- config: TransformConfig,
Bases:
BaseTransformReplaces torch_fused_gdn_gating ops with triton_fused_gdn_gating.
This transform runs in the post_load_fusion stage and swaps the pure-torch source op with a single-kernel Triton implementation, eliminating ~5 kernel launches per GDN layer.
- Parameters:
gm – Input graph module to transform.
- Returns:
Transformed graph module with Triton-fused GDN gating operations.
- classmethod get_config_class() Type[TransformConfig][source]#
Get the configuration class for the transform.
This is used to validate the configuration of the transform.
YAML configuration
Uses the common TransformConfig fields documented in Core Transform APIs.
Fuse L2Norm#
Transform key: fuse_l2norm
Source module: tensorrt_llm._torch.auto_deploy.transform.library.l2_norm
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.l2_norm.FuseL2Norm(
- config: TransformConfig,
Bases:
BaseTransformFuses torch_l2norm ops with the selected backend implementation.
This transform runs in the post_load_fusion stage and replaces torch_l2norm ops with the specified backend implementation (fla or torch).
- Parameters:
gm – Input graph module to transform.
backend – Backend to use for L2Norm computation (“fla” or “torch”).
- Returns:
Transformed graph module with backend-specific L2Norm operations.
- classmethod get_config_class() Type[TransformConfig][source]#
Get the configuration class for the transform.
This is used to validate the configuration of the transform.
YAML configuration
The fields below can be set under this transform’s entry in the AutoDeploy config YAML.
- pydantic model tensorrt_llm._torch.auto_deploy.transform.library.l2_norm.FuseL2NormConfig[source]
Bases:
TransformConfigConfiguration for the L2Norm fusion transform.
Show JSON schema
{ "title": "FuseL2NormConfig", "description": "Configuration for the L2Norm fusion transform.", "type": "object", "properties": { "stage": { "$ref": "#/$defs/Stages", "description": "The stage of the transformation pipeline where this transform should run." }, "run_per_gm": { "default": true, "description": "Whether to run the transform per graph (sub)module or on whole module.", "title": "Run Per Gm", "type": "boolean" }, "enabled": { "default": true, "description": "Whether to enable this transform.", "title": "Enabled", "type": "boolean" }, "skip_on_error": { "default": false, "description": "Whether to skip the transform if an error occurs.", "title": "Skip On Error", "type": "boolean" }, "run_graph_cleanup": { "default": true, "description": "Whether to run graph cleanup/canonicalization after this transform.", "title": "Run Graph Cleanup", "type": "boolean" }, "run_shape_prop": { "default": false, "description": "Whether to run shape propagation after this transform.", "title": "Run Shape Prop", "type": "boolean" }, "requires_clean_graph": { "default": true, "description": "Whether this transform requires the graph to be clean before it is applied.", "title": "Requires Clean Graph", "type": "boolean" }, "requires_shape_prop": { "default": false, "description": "Whether this transform requires shape propagation before it is applied.", "title": "Requires Shape Prop", "type": "boolean" }, "debug_visualize_dir": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.", "title": "Debug Visualize Dir" }, "expect_mem_change": { "default": false, "description": "Whether this transform is expected to cause changes in CUDA memory stats.", "title": "Expect Mem Change", "type": "boolean" }, "backend": { "default": "fla", "description": "Backend to use for L2Norm computation ('fla' or 'torch').", "enum": [ "torch", "fla" ], "title": "Backend", "type": "string" } }, "$defs": { "Stages": { "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.", "enum": [ "factory", "export", "post_export", "pattern_matcher", "sharding", "weight_load", "post_load_fusion", "cache_init", "visualize", "compile" ], "title": "Stages", "type": "string" } }, "additionalProperties": true, "required": [ "stage" ] }
- Config:
extra: str = allow
- Fields:
backend (Literal['torch', 'fla'])
- field backend: Literal['torch', 'fla'] = 'fla'
Backend to use for L2Norm computation (‘fla’ or ‘torch’).
Fuse SiLU Mul#
Transform key: fuse_silu_mul
Source module: tensorrt_llm._torch.auto_deploy.transform.library.fuse_silu_mul
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.fuse_silu_mul.FuseSiluMul(
- config: TransformConfig,
Bases:
BaseTransformFuse narrow+silu+mul into a single silu_and_mul op after GEMM fusion.
- Detects the pattern:
gate = narrow(x, -1, 0, size) up = narrow(x, -1, size, size) hidden = silu(gate) * up
- And replaces it with:
hidden = silu_and_mul(x)
When
backend='trtllm', also detects if the sole consumer is an FP8 linear and fuses quantization into the kernel (eliminating a separate scaleMatrix pass).This runs as a post_load_fusion pass, after GEMM fusion has combined gate+up projections.
- classmethod get_config_class() Type[TransformConfig][source]#
Get the configuration class for the transform.
This is used to validate the configuration of the transform.
YAML configuration
The fields below can be set under this transform’s entry in the AutoDeploy config YAML.
- pydantic model tensorrt_llm._torch.auto_deploy.transform.library.fuse_silu_mul.FuseSiluMulConfig[source]
Bases:
TransformConfigConfiguration for the SiLU+Mul fusion transform.
Show JSON schema
{ "title": "FuseSiluMulConfig", "description": "Configuration for the SiLU+Mul fusion transform.", "type": "object", "properties": { "stage": { "$ref": "#/$defs/Stages", "description": "The stage of the transformation pipeline where this transform should run." }, "run_per_gm": { "default": true, "description": "Whether to run the transform per graph (sub)module or on whole module.", "title": "Run Per Gm", "type": "boolean" }, "enabled": { "default": true, "description": "Whether to enable this transform.", "title": "Enabled", "type": "boolean" }, "skip_on_error": { "default": false, "description": "Whether to skip the transform if an error occurs.", "title": "Skip On Error", "type": "boolean" }, "run_graph_cleanup": { "default": true, "description": "Whether to run graph cleanup/canonicalization after this transform.", "title": "Run Graph Cleanup", "type": "boolean" }, "run_shape_prop": { "default": false, "description": "Whether to run shape propagation after this transform.", "title": "Run Shape Prop", "type": "boolean" }, "requires_clean_graph": { "default": true, "description": "Whether this transform requires the graph to be clean before it is applied.", "title": "Requires Clean Graph", "type": "boolean" }, "requires_shape_prop": { "default": false, "description": "Whether this transform requires shape propagation before it is applied.", "title": "Requires Shape Prop", "type": "boolean" }, "debug_visualize_dir": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.", "title": "Debug Visualize Dir" }, "expect_mem_change": { "default": false, "description": "Whether this transform is expected to cause changes in CUDA memory stats.", "title": "Expect Mem Change", "type": "boolean" }, "backend": { "default": "flashinfer", "description": "Backend for fused SiLU+Mul kernel. 'flashinfer' (default) or 'trtllm' (faster, supports fused FP8 quant).", "title": "Backend", "type": "string" } }, "$defs": { "Stages": { "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.", "enum": [ "factory", "export", "post_export", "pattern_matcher", "sharding", "weight_load", "post_load_fusion", "cache_init", "visualize", "compile" ], "title": "Stages", "type": "string" } }, "additionalProperties": true, "required": [ "stage" ] }
- Config:
extra: str = allow
- Fields:
backend (str)
- field backend: str = 'flashinfer'
Backend for fused SiLU+Mul kernel. ‘flashinfer’ (default) or ‘trtllm’ (faster, supports fused FP8 quant).
Fuse SwiGLU#
Transform key: fuse_swiglu
Source module: tensorrt_llm._torch.auto_deploy.transform.library.fuse_swiglu
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.fuse_swiglu.FuseSwiGLU(
- config: TransformConfig,
Bases:
BaseTransformFuses torch_swiglu_mlp ops by concatenating gate and up weights.
This transform runs in the post_load_fusion stage and replaces torch_swiglu_mlp ops with fused_swiglu_mlp ops that use a single concatenated gate+up weight matrix.
This reduces memory bandwidth by performing a single matmul instead of two separate matmuls for gate and up projections.
YAML configuration
The fields below can be set under this transform’s entry in the AutoDeploy config YAML.
- pydantic model tensorrt_llm._torch.auto_deploy.transform.library.fuse_swiglu.FuseSwiGLUConfig[source]
Bases:
TransformConfigConfiguration for the SwiGLU fusion transform.
Show JSON schema
{ "title": "FuseSwiGLUConfig", "description": "Configuration for the SwiGLU fusion transform.", "type": "object", "properties": { "stage": { "$ref": "#/$defs/Stages", "description": "The stage of the transformation pipeline where this transform should run." }, "run_per_gm": { "default": true, "description": "Whether to run the transform per graph (sub)module or on whole module.", "title": "Run Per Gm", "type": "boolean" }, "enabled": { "default": true, "description": "Whether to enable SwiGLU fusion.", "title": "Enabled", "type": "boolean" }, "skip_on_error": { "default": false, "description": "Whether to skip the transform if an error occurs.", "title": "Skip On Error", "type": "boolean" }, "run_graph_cleanup": { "default": true, "description": "Whether to run graph cleanup/canonicalization after this transform.", "title": "Run Graph Cleanup", "type": "boolean" }, "run_shape_prop": { "default": false, "description": "Whether to run shape propagation after this transform.", "title": "Run Shape Prop", "type": "boolean" }, "requires_clean_graph": { "default": true, "description": "Whether this transform requires the graph to be clean before it is applied.", "title": "Requires Clean Graph", "type": "boolean" }, "requires_shape_prop": { "default": false, "description": "Whether this transform requires shape propagation before it is applied.", "title": "Requires Shape Prop", "type": "boolean" }, "debug_visualize_dir": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.", "title": "Debug Visualize Dir" }, "expect_mem_change": { "default": false, "description": "Whether this transform is expected to cause changes in CUDA memory stats.", "title": "Expect Mem Change", "type": "boolean" } }, "$defs": { "Stages": { "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.", "enum": [ "factory", "export", "post_export", "pattern_matcher", "sharding", "weight_load", "post_load_fusion", "cache_init", "visualize", "compile" ], "title": "Stages", "type": "string" } }, "additionalProperties": true, "required": [ "stage" ] }
- Config:
extra: str = allow
- Fields:
enabled (bool)
- field enabled: bool = True
Whether to enable SwiGLU fusion.
Fuse Add Rms Norm#
Transform key: fuse_add_rms_norm
Source module: tensorrt_llm._torch.auto_deploy.transform.library.fused_add_rms_norm
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.fused_add_rms_norm.FuseAddRMSNorm(
- config: TransformConfig,
Bases:
BaseTransformFuse (add + optional cast + RMSNorm) into one fused op.
Uses direct FX graph manipulation instead of the inductor pattern matcher to correctly handle patterns where intermediate nodes (add, rms_norm) have multiple users in the graph.
- Pattern 1 (without cast):
%add = aten.add(%x, %residual) %norm = flashinfer_rms_norm(%add, %weight, eps)
- Pattern 2 (with cast):
%add = aten.add(%x, %residual) %cast = aten.to.dtype(%add, bfloat16) %norm = flashinfer_rms_norm(%cast, %weight, eps)
- Both are replaced with:
%fused = flashinfer_fused_add_rms_norm(%x, %residual, %weight, eps) %norm_out = getitem(%fused, 0) # norm result (replaces %norm) %add_out = getitem(%fused, 1) # add result (replaces %add)
YAML configuration
Uses the common TransformConfig fields documented in Core Transform APIs.
Fuse RMSNorm Quant FP8#
Transform key: fuse_rmsnorm_quant_fp8
Source module: tensorrt_llm._torch.auto_deploy.transform.library.fuse_rmsnorm_quant_fp8
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.fuse_rmsnorm_quant_fp8.FuseRMSNormQuantFP8(
- config: TransformConfig,
Bases:
BaseTransform- classmethod get_config_class() Type[TransformConfig][source]#
Get the configuration class for the transform.
This is used to validate the configuration of the transform.
YAML configuration
Uses the common TransformConfig fields documented in Core Transform APIs.
Gather Logits Before Lm Head#
Transform key: gather_logits_before_lm_head
Source module: tensorrt_llm._torch.auto_deploy.transform.library.gather_logits_before_lm_head
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.gather_logits_before_lm_head.GatherLogitsBeforeLmHeadTransform(
- config: TransformConfig,
Bases:
BaseTransformTransform to gather hidden states before LM head using logits_gather_mask.
This transform inserts a gather operation before the LM head linear layer to select only the hidden states that need logits computed. The output is always [b, hidden_size] in decode-only for CUDA graph compatibility.
Benefits: - Reduces computation by only computing logits for needed tokens - Eliminates Python loop overhead - Enables CUDA graph capture of the gather - Moves gather into the graph for better optimization
YAML configuration
Uses the common TransformConfig fields documented in Core Transform APIs.
Fuse RoPE Into TRT-LLM Attention#
Transform key: fuse_rope_into_trtllm_attention
Source module: tensorrt_llm._torch.auto_deploy.transform.library.fuse_rope_into_trtllm_attention
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.fuse_rope_into_trtllm_attention.FuseRopeIntoTrtllmAttention(
- config: TransformConfig,
Bases:
BaseTransformFuse RoPE into trtllm attention by rewiring Q/K and storing rope metadata.
Runs at
post_load_fusionbeforeoptimize_rope, matching the backend-agnostictorch_rope_*IR ops directly with real (non-meta) weights. DCE after this transform removes dead rope nodes;optimize_ropethen handles remainingtorch_rope_*for non-trtllm backends.Stores the thop-format cos_sin tensor in
attn_node.meta.TrtllmAttention.prepare_node_for_cache_insertionatcache_initmaterializes it as a graph node.Disabled by default; enable in model configs that use
attn_backend: trtllm.- classmethod get_config_class() Type[TransformConfig][source]#
Get the configuration class for the transform.
This is used to validate the configuration of the transform.
YAML configuration
The fields below can be set under this transform’s entry in the AutoDeploy config YAML.
- pydantic model tensorrt_llm._torch.auto_deploy.transform.library.fuse_rope_into_trtllm_attention.FuseRopeIntoTrtllmAttentionConfig[source]
Bases:
TransformConfigConfiguration for
fuse_rope_into_trtllm_attention.Show JSON schema
{ "title": "FuseRopeIntoTrtllmAttentionConfig", "description": "Configuration for ``fuse_rope_into_trtllm_attention``.", "type": "object", "properties": { "stage": { "$ref": "#/$defs/Stages", "description": "The stage of the transformation pipeline where this transform should run." }, "run_per_gm": { "default": true, "description": "Whether to run the transform per graph (sub)module or on whole module.", "title": "Run Per Gm", "type": "boolean" }, "enabled": { "default": true, "description": "Whether to enable this transform.", "title": "Enabled", "type": "boolean" }, "skip_on_error": { "default": false, "description": "Whether to skip the transform if an error occurs.", "title": "Skip On Error", "type": "boolean" }, "run_graph_cleanup": { "default": true, "description": "Whether to run graph cleanup/canonicalization after this transform.", "title": "Run Graph Cleanup", "type": "boolean" }, "run_shape_prop": { "default": false, "description": "Whether to run shape propagation after this transform.", "title": "Run Shape Prop", "type": "boolean" }, "requires_clean_graph": { "default": true, "description": "Whether this transform requires the graph to be clean before it is applied.", "title": "Requires Clean Graph", "type": "boolean" }, "requires_shape_prop": { "default": false, "description": "Whether this transform requires shape propagation before it is applied.", "title": "Requires Shape Prop", "type": "boolean" }, "debug_visualize_dir": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.", "title": "Debug Visualize Dir" }, "expect_mem_change": { "default": false, "description": "Whether this transform is expected to cause changes in CUDA memory stats.", "title": "Expect Mem Change", "type": "boolean" }, "fuse_qkv_passthrough": { "default": true, "description": "When the pre-RoPE Q/K/V trace back to a single fused QKV GEMM, rewire all three to that flat tensor so ``trtllm_mha_with_cache`` can skip the per-layer split \u2192 reshape \u2192 cat path.", "title": "Fuse Qkv Passthrough", "type": "boolean" } }, "$defs": { "Stages": { "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.", "enum": [ "factory", "export", "post_export", "pattern_matcher", "sharding", "weight_load", "post_load_fusion", "cache_init", "visualize", "compile" ], "title": "Stages", "type": "string" } }, "additionalProperties": true, "required": [ "stage" ] }
- Config:
extra: str = allow
- Fields:
fuse_qkv_passthrough (bool)
- field fuse_qkv_passthrough: bool = True
When the pre-RoPE Q/K/V trace back to a single fused QKV GEMM, rewire all three to that flat tensor so
trtllm_mha_with_cachecan skip the per-layer split → reshape → cat path.
Fuse RoPE Into TRT-LLM MLA#
Transform key: fuse_rope_into_trtllm_mla
Source module: tensorrt_llm._torch.auto_deploy.transform.library.fuse_rope_mla
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.fuse_rope_mla.FuseRopeIntoTrtllmMLA(
- config: TransformConfig,
Bases:
BaseTransformFuse RoPE into TRT-LLM MLA attention for decode performance.
Runs at
post_load_fusionbeforeoptimize_rope, matching the backend-agnostictorch_rope_*IR ops ontorch_mlasource nodes. Rewires q_pe/kpe to pre-RoPE inputs and stashes the rotary_cos_sin tensor innode.metafor later materialization atcache_initbyTrtllmMLAAttention.prepare_node_for_cache_insertion.
YAML configuration
Uses the common TransformConfig fields documented in Core Transform APIs.
Optimize RoPE#
Transform key: optimize_rope
Source module: tensorrt_llm._torch.auto_deploy.transform.library.rope
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.rope.OptimizeRope(
- config: TransformConfig,
Bases:
BaseTransformScan the FX graph and replace calls to the torch-reference RoPE ops with optimized kernels: -
torch_rope_with_explicit_cos_sin→flashinfer_rope-torch_rope_with_complex_freqs→flashinfer_rope-torch_rope_with_qk_interleaving→triton_rope_on_interleaved_qk_inputsPrecomputes positional IDs and the fused cosine-sine cache as explicit nodes, and reuses those nodes when possible.
YAML configuration
Uses the common TransformConfig fields documented in Core Transform APIs.
MLIR Elementwise Fusion#
Transform key: mlir_elementwise_fusion
Source module: tensorrt_llm._torch.auto_deploy.transform.library.mlir_elementwise_fusion
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.mlir_elementwise_fusion.MLIRElementwiseFusion(
- config: TransformConfig,
Bases:
BaseTransformUnified MLIR elementwise fusion: decompose + discover + codegen + replace.
This transform: 1. Converts the FX graph to MLIR (xDSL) using the
addialect 2. Decomposes high-level ops into elementwise primitives 3. Discovers maximal fusible subgraphs 4. Generates Triton kernels for each discovered subgraph 5. Replaces subgraph ops in MLIR with fused opaque ops 6. Converts MLIR back to FX with generated kernel callsRequires
pip install xdsl. Skipped gracefully if xDSL is not installed.- classmethod get_config_class() Type[TransformConfig][source]#
Get the configuration class for the transform.
This is used to validate the configuration of the transform.
YAML configuration
The fields below can be set under this transform’s entry in the AutoDeploy config YAML.
- pydantic model tensorrt_llm._torch.auto_deploy.transform.library.mlir_elementwise_fusion.MLIRElementwiseFusionConfig[source]
Bases:
TransformConfigConfiguration for the MLIR elementwise fusion transform.
Show JSON schema
{ "title": "MLIRElementwiseFusionConfig", "description": "Configuration for the MLIR elementwise fusion transform.", "type": "object", "properties": { "stage": { "$ref": "#/$defs/Stages", "description": "The stage of the transformation pipeline where this transform should run." }, "run_per_gm": { "default": true, "description": "Whether to run the transform per graph (sub)module or on whole module.", "title": "Run Per Gm", "type": "boolean" }, "enabled": { "default": true, "description": "Whether to enable this transform.", "title": "Enabled", "type": "boolean" }, "skip_on_error": { "default": false, "description": "Whether to skip the transform if an error occurs.", "title": "Skip On Error", "type": "boolean" }, "run_graph_cleanup": { "default": true, "description": "Whether to run graph cleanup/canonicalization after this transform.", "title": "Run Graph Cleanup", "type": "boolean" }, "run_shape_prop": { "default": false, "description": "Whether to run shape propagation after this transform.", "title": "Run Shape Prop", "type": "boolean" }, "requires_clean_graph": { "default": true, "description": "Whether this transform requires the graph to be clean before it is applied.", "title": "Requires Clean Graph", "type": "boolean" }, "requires_shape_prop": { "default": false, "description": "Whether this transform requires shape propagation before it is applied.", "title": "Requires Shape Prop", "type": "boolean" }, "debug_visualize_dir": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.", "title": "Debug Visualize Dir" }, "expect_mem_change": { "default": false, "description": "Whether this transform is expected to cause changes in CUDA memory stats.", "title": "Expect Mem Change", "type": "boolean" }, "bypass_ops": { "description": "Op names to skip during decomposition (reserved for future use).", "items": { "type": "string" }, "title": "Bypass Ops", "type": "array" } }, "$defs": { "Stages": { "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.", "enum": [ "factory", "export", "post_export", "pattern_matcher", "sharding", "weight_load", "post_load_fusion", "cache_init", "visualize", "compile" ], "title": "Stages", "type": "string" } }, "additionalProperties": true, "required": [ "stage" ] }
- Config:
extra: str = allow
- Fields:
bypass_ops (List[str])
- field bypass_ops: List[str] [Optional]
Op names to skip during decomposition (reserved for future use).