Post-Load Fusion Stage#

Post-load fusion applies performance optimizations that need loaded weights, device tensors, or the final post-sharding graph structure. This stage includes kernel fusions for quantized linear layers, MoE, normalization, activation, RoPE, and related inference patterns.

Fuse Gemms Mixed Children#

Transform key: fuse_gemms_mixed_children

Source module: tensorrt_llm._torch.auto_deploy.transform.library.fusion

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.fusion.FuseGemmsMixedChildren(
config: TransformConfig,
)[source]#

Bases: BaseTransform

Fuse linear projections sharing the same input, even when the parent has non-linear users (e.g., shape access).

This is a relaxed variant of FuseGemms: it does NOT require all children of the parent to be linear ops — only that at least 2 linear children exist. The fused output is split via torch.narrow (zero-copy view).

Handles both non-quantized and quantized (FP8, FP4) linear ops. Nodes are grouped by (parent, quantization scheme) so only linears with the same parent AND the same op target are fused together.

YAML configuration

Uses the common TransformConfig fields documented in Core Transform APIs.

Fuse Gemms#

Transform key: fuse_gemms

Source module: tensorrt_llm._torch.auto_deploy.transform.library.fusion

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.fusion.FuseGemms(
config: TransformConfig,
)[source]#

Bases: BaseTransform

YAML configuration

Uses the common TransformConfig fields documented in Core Transform APIs.

Fuse Fp4 Gemms#

Transform key: fuse_fp4_gemms

Source module: tensorrt_llm._torch.auto_deploy.transform.library.fusion

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.fusion.FuseFP4Gemms(
config: TransformConfig,
)[source]#

Bases: QuantizationFusionMixin, BaseTransform

build_custom_args_for_linear(
scale_getattrs: Dict[str, Node],
) Tuple[object, ...][source]#

Return the positional tail after bias for the fused call.

YAML configuration

Uses the common TransformConfig fields documented in Core Transform APIs.

Fuse FP8 Gemms#

Transform key: fuse_fp8_gemms

Source module: tensorrt_llm._torch.auto_deploy.transform.library.fusion

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.fusion.FuseFP8Gemms(
config: TransformConfig,
)[source]#

Bases: QuantizationFusionMixin, BaseTransform

build_custom_args_for_linear(
scale_getattrs: Dict[str, Node],
) Tuple[object, ...][source]#

Return the positional tail after bias for the fused call.

YAML configuration

Uses the common TransformConfig fields documented in Core Transform APIs.

Fuse FP8 Linear#

Transform key: fuse_fp8_linear

Source module: tensorrt_llm._torch.auto_deploy.transform.library.fuse_quant

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.fuse_quant.FuseFP8Linear(
config: TransformConfig,
)[source]#

Bases: BaseTransform

Matches and replaces FP8 fake quantized linear ops with fused torch backend ops.

classmethod get_config_class() Type[TransformConfig][source]#

Get the configuration class for the transform.

This is used to validate the configuration of the transform.

YAML configuration

The fields below can be set under this transform’s entry in the AutoDeploy config YAML.

pydantic model tensorrt_llm._torch.auto_deploy.transform.library.fuse_quant.FuseFP8LinearConfig[source]

Bases: TransformConfig

Configuration for FP8 linear fusion transform.

Show JSON schema
{
   "title": "FuseFP8LinearConfig",
   "description": "Configuration for FP8 linear fusion transform.",
   "type": "object",
   "properties": {
      "stage": {
         "$ref": "#/$defs/Stages",
         "description": "The stage of the transformation pipeline where this transform should run."
      },
      "run_per_gm": {
         "default": true,
         "description": "Whether to run the transform per graph (sub)module or on whole module.",
         "title": "Run Per Gm",
         "type": "boolean"
      },
      "enabled": {
         "default": true,
         "description": "Whether to enable this transform.",
         "title": "Enabled",
         "type": "boolean"
      },
      "skip_on_error": {
         "default": false,
         "description": "Whether to skip the transform if an error occurs.",
         "title": "Skip On Error",
         "type": "boolean"
      },
      "run_graph_cleanup": {
         "default": true,
         "description": "Whether to run graph cleanup/canonicalization after this transform.",
         "title": "Run Graph Cleanup",
         "type": "boolean"
      },
      "run_shape_prop": {
         "default": false,
         "description": "Whether to run shape propagation after this transform.",
         "title": "Run Shape Prop",
         "type": "boolean"
      },
      "requires_clean_graph": {
         "default": true,
         "description": "Whether this transform requires the graph to be clean before it is applied.",
         "title": "Requires Clean Graph",
         "type": "boolean"
      },
      "requires_shape_prop": {
         "default": false,
         "description": "Whether this transform requires shape propagation before it is applied.",
         "title": "Requires Shape Prop",
         "type": "boolean"
      },
      "debug_visualize_dir": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.",
         "title": "Debug Visualize Dir"
      },
      "expect_mem_change": {
         "default": false,
         "description": "Whether this transform is expected to cause changes in CUDA memory stats.",
         "title": "Expect Mem Change",
         "type": "boolean"
      },
      "backend": {
         "default": "torch",
         "description": "Backend to use for FP8 linear computation (default: 'torch').",
         "title": "Backend",
         "type": "string"
      }
   },
   "$defs": {
      "Stages": {
         "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.",
         "enum": [
            "factory",
            "export",
            "post_export",
            "pattern_matcher",
            "sharding",
            "weight_load",
            "post_load_fusion",
            "cache_init",
            "visualize",
            "compile"
         ],
         "title": "Stages",
         "type": "string"
      }
   },
   "additionalProperties": true,
   "required": [
      "stage"
   ]
}

Config:
  • extra: str = allow

Fields:
  • backend (str)

field backend: str = 'torch'

Backend to use for FP8 linear computation (default: ‘torch’).

Fuse TRT-LLM Attn Quant FP8#

Transform key: fuse_trtllm_attn_quant_fp8

Source module: tensorrt_llm._torch.auto_deploy.transform.library.fuse_trtllm_attention_quant_fp8

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.fuse_trtllm_attention_quant_fp8.FuseTrtllmAttentionQuantFP8(
config: TransformConfig,
)[source]#

Bases: BaseTransform

Prepare attention->FP8 linear path so TRTLLM attention can emit FP8 directly.

classmethod get_config_class() Type[TransformConfig][source]#

Get the configuration class for the transform.

This is used to validate the configuration of the transform.

YAML configuration

Uses the common TransformConfig fields documented in Core Transform APIs.

Fuse NVFP4 Linear#

Transform key: fuse_nvfp4_linear

Source module: tensorrt_llm._torch.auto_deploy.transform.library.fuse_quant

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.fuse_quant.FuseNVFP4Linear(
config: TransformConfig,
)[source]#

Bases: BaseTransform

Matches and replaces NVFP4 fake quantized linear ops with fused TensorRT-LLM ops.

classmethod get_config_class() Type[TransformConfig][source]#

Get the configuration class for the transform.

This is used to validate the configuration of the transform.

YAML configuration

The fields below can be set under this transform’s entry in the AutoDeploy config YAML.

pydantic model tensorrt_llm._torch.auto_deploy.transform.library.fuse_quant.FuseNVFP4LinearConfig[source]

Bases: TransformConfig

Configuration for NVFP4 linear fusion transform.

Show JSON schema
{
   "title": "FuseNVFP4LinearConfig",
   "description": "Configuration for NVFP4 linear fusion transform.",
   "type": "object",
   "properties": {
      "stage": {
         "$ref": "#/$defs/Stages",
         "description": "The stage of the transformation pipeline where this transform should run."
      },
      "run_per_gm": {
         "default": true,
         "description": "Whether to run the transform per graph (sub)module or on whole module.",
         "title": "Run Per Gm",
         "type": "boolean"
      },
      "enabled": {
         "default": true,
         "description": "Whether to enable this transform.",
         "title": "Enabled",
         "type": "boolean"
      },
      "skip_on_error": {
         "default": false,
         "description": "Whether to skip the transform if an error occurs.",
         "title": "Skip On Error",
         "type": "boolean"
      },
      "run_graph_cleanup": {
         "default": true,
         "description": "Whether to run graph cleanup/canonicalization after this transform.",
         "title": "Run Graph Cleanup",
         "type": "boolean"
      },
      "run_shape_prop": {
         "default": false,
         "description": "Whether to run shape propagation after this transform.",
         "title": "Run Shape Prop",
         "type": "boolean"
      },
      "requires_clean_graph": {
         "default": true,
         "description": "Whether this transform requires the graph to be clean before it is applied.",
         "title": "Requires Clean Graph",
         "type": "boolean"
      },
      "requires_shape_prop": {
         "default": false,
         "description": "Whether this transform requires shape propagation before it is applied.",
         "title": "Requires Shape Prop",
         "type": "boolean"
      },
      "debug_visualize_dir": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.",
         "title": "Debug Visualize Dir"
      },
      "expect_mem_change": {
         "default": false,
         "description": "Whether this transform is expected to cause changes in CUDA memory stats.",
         "title": "Expect Mem Change",
         "type": "boolean"
      },
      "backend": {
         "default": "trtllm",
         "description": "Backend to use for NVFP4 linear computation (default: 'trtllm').",
         "title": "Backend",
         "type": "string"
      }
   },
   "$defs": {
      "Stages": {
         "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.",
         "enum": [
            "factory",
            "export",
            "post_export",
            "pattern_matcher",
            "sharding",
            "weight_load",
            "post_load_fusion",
            "cache_init",
            "visualize",
            "compile"
         ],
         "title": "Stages",
         "type": "string"
      }
   },
   "additionalProperties": true,
   "required": [
      "stage"
   ]
}

Config:
  • extra: str = allow

Fields:
  • backend (str)

field backend: str = 'trtllm'

Backend to use for NVFP4 linear computation (default: ‘trtllm’).

Fuse Relu2 Quant NVFP4#

Transform key: fuse_relu2_quant_nvfp4

Source module: tensorrt_llm._torch.auto_deploy.transform.library.fuse_relu2_quant_nvfp4

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.fuse_relu2_quant_nvfp4.FuseRelu2QuantNVFP4(
config: TransformConfig,
)[source]#

Bases: BaseTransform

Fuse matcher-supported ReLU² + NVFP4 quantization patterns.

classmethod get_config_class() Type[TransformConfig][source]#

Get the configuration class for the transform.

This is used to validate the configuration of the transform.

YAML configuration

Uses the common TransformConfig fields documented in Core Transform APIs.

Fuse NVFP4 SwiGLU#

Transform key: fuse_nvfp4_swiglu

Source module: tensorrt_llm._torch.auto_deploy.transform.library.fuse_swiglu

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.fuse_swiglu.FuseNVFP4SwiGLU(
config: TransformConfig,
)[source]#

Bases: BaseTransform

Fuses torch_nvfp4_swiglu_mlp ops by concatenating gate and up FP4 weights.

This transform runs in the post_load_fusion stage and replaces torch_nvfp4_swiglu_mlp ops with fused_nvfp4_swiglu_mlp ops that use a single concatenated gate+up weight matrix.

FP4 weight fusion: - gate+up packed weights are concatenated along dim=0 - gate+up per-block weight scales are concatenated along dim=0 - gate+up input_scale and alpha must match (shared input)

classmethod get_config_class() Type[TransformConfig][source]#

Get the configuration class for the transform.

This is used to validate the configuration of the transform.

YAML configuration

Uses the common TransformConfig fields documented in Core Transform APIs.

Fuse Finegrained FP8 SwiGLU#

Transform key: fuse_finegrained_fp8_swiglu

Source module: tensorrt_llm._torch.auto_deploy.transform.library.fuse_swiglu

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.fuse_swiglu.FuseFineGrainedFP8SwiGLU(
config: TransformConfig,
)[source]#

Bases: BaseTransform

Fuses torch_finegrained_fp8_swiglu_mlp ops by concatenating gate and up FP8 weights.

This transform runs in the post_load_fusion stage and replaces torch_finegrained_fp8_swiglu_mlp ops with one of:

  • fused_finegrained_fp8_swiglu_mlp — default FP32 per-block scale path using trtllm_finegrained_fp8_linear internally.

  • fused_finegrained_fp8_deepgemm_swiglu_mlp — Blackwell (SM100f) UE8M0 path using trtllm_fp8_deepgemm internally. Selected at compile time when the concatenated gate+up and down weight scales are UE8M0 packed int (set by FineGrainedFP8LinearQuantization.post_load_hook).

FP8 weight fusion: - gate+up FP8 weights are concatenated along dim=0: [N, K] -> [2N, K] - gate+up per-block weight scales are concatenated along dim=0:

[N/128, K/128] -> [2N/128, K/128]

classmethod get_config_class() Type[TransformConfig][source]#

Get the configuration class for the transform.

This is used to validate the configuration of the transform.

YAML configuration

Uses the common TransformConfig fields documented in Core Transform APIs.

Fuse Finegrained FP8 Linear#

Transform key: fuse_finegrained_fp8_linear

Source module: tensorrt_llm._torch.auto_deploy.transform.library.fuse_quant

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.fuse_quant.FuseFineGrainedFP8Linear(
config: TransformConfig,
)[source]#

Bases: BaseTransform

Matches and replaces FineGrained FP8 fake quantized linear ops with TRT-LLM ops.

Two-stage pipeline:
  1. Pattern matcher rewrites torch_fake_quant_finegrained_fp8_linear (HuggingFace triton kernel) to trtllm_finegrained_fp8_linear (TRT-LLM fp8_block_scaling_gemm with FP32 per-block scales).

  2. A compile-time dispatch pass further rewrites any nodes whose weight_scale buffer is UE8M0 packed int (produced by FineGrainedFP8LinearQuantization.post_load_hook on SM100f) to the dedicated trtllm_fp8_deepgemm op. Keeping the SM100f/UE8M0 path in a separate op avoids per-call hardware / dtype branching inside the runtime op.

Used for models like MiniMax M2 and DeepSeek that use HuggingFace’s FineGrained FP8 quantization format with 128x128 block sizes.

classmethod get_config_class() Type[TransformConfig][source]#

Get the configuration class for the transform.

This is used to validate the configuration of the transform.

YAML configuration

The fields below can be set under this transform’s entry in the AutoDeploy config YAML.

pydantic model tensorrt_llm._torch.auto_deploy.transform.library.fuse_quant.FuseFineGrainedFP8LinearConfig[source]

Bases: TransformConfig

Configuration for FineGrained FP8 linear fusion transform.

Show JSON schema
{
   "title": "FuseFineGrainedFP8LinearConfig",
   "description": "Configuration for FineGrained FP8 linear fusion transform.",
   "type": "object",
   "properties": {
      "stage": {
         "$ref": "#/$defs/Stages",
         "description": "The stage of the transformation pipeline where this transform should run."
      },
      "run_per_gm": {
         "default": true,
         "description": "Whether to run the transform per graph (sub)module or on whole module.",
         "title": "Run Per Gm",
         "type": "boolean"
      },
      "enabled": {
         "default": true,
         "description": "Whether to enable this transform.",
         "title": "Enabled",
         "type": "boolean"
      },
      "skip_on_error": {
         "default": false,
         "description": "Whether to skip the transform if an error occurs.",
         "title": "Skip On Error",
         "type": "boolean"
      },
      "run_graph_cleanup": {
         "default": true,
         "description": "Whether to run graph cleanup/canonicalization after this transform.",
         "title": "Run Graph Cleanup",
         "type": "boolean"
      },
      "run_shape_prop": {
         "default": false,
         "description": "Whether to run shape propagation after this transform.",
         "title": "Run Shape Prop",
         "type": "boolean"
      },
      "requires_clean_graph": {
         "default": true,
         "description": "Whether this transform requires the graph to be clean before it is applied.",
         "title": "Requires Clean Graph",
         "type": "boolean"
      },
      "requires_shape_prop": {
         "default": false,
         "description": "Whether this transform requires shape propagation before it is applied.",
         "title": "Requires Shape Prop",
         "type": "boolean"
      },
      "debug_visualize_dir": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.",
         "title": "Debug Visualize Dir"
      },
      "expect_mem_change": {
         "default": false,
         "description": "Whether this transform is expected to cause changes in CUDA memory stats.",
         "title": "Expect Mem Change",
         "type": "boolean"
      },
      "backend": {
         "default": "trtllm",
         "description": "Backend to use for FineGrained FP8 linear computation (default: 'trtllm').",
         "title": "Backend",
         "type": "string"
      }
   },
   "$defs": {
      "Stages": {
         "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.",
         "enum": [
            "factory",
            "export",
            "post_export",
            "pattern_matcher",
            "sharding",
            "weight_load",
            "post_load_fusion",
            "cache_init",
            "visualize",
            "compile"
         ],
         "title": "Stages",
         "type": "string"
      }
   },
   "additionalProperties": true,
   "required": [
      "stage"
   ]
}

Config:
  • extra: str = allow

Fields:
  • backend (str)

field backend: str = 'trtllm'

Backend to use for FineGrained FP8 linear computation (default: ‘trtllm’).

Fuse MXFP4 MoE#

Transform key: fuse_mxfp4_moe

Source module: tensorrt_llm._torch.auto_deploy.transform.library.fused_moe_mxfp4

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.fused_moe_mxfp4.FuseMXFP4Moe(
config: TransformConfig,
)[source]#

Bases: BaseTransform

POST_LOAD_FUSION transform: GPU-side MXFP4 MoE weight prep for the trtllm-gen backend.

Runs after QuantizeMXFP4MOE registered raw HF MXFP4 buffers and the EP-slice load hook populated them. For each trtllm_quant_mxfp4_trtllm_gen_moe_fused node, calls prepare_trtllm_gen_moe_mxfp4_weights() on the loaded GPU tensors to produce the kernel layout, swaps the op args to the prepared params, and deletes the raw buffers.

Skipped when the op already references prepared params (idempotent).

classmethod get_config_class() Type[TransformConfig][source]#

Get the configuration class for the transform.

This is used to validate the configuration of the transform.

YAML configuration

The fields below can be set under this transform’s entry in the AutoDeploy config YAML.

pydantic model tensorrt_llm._torch.auto_deploy.transform.library.fused_moe_mxfp4.FuseMXFP4MoeConfig[source]

Bases: TransformConfig

Configuration for fuse_mxfp4_moe (POST_LOAD_FUSION).

Show JSON schema
{
   "title": "FuseMXFP4MoeConfig",
   "description": "Configuration for ``fuse_mxfp4_moe`` (POST_LOAD_FUSION).",
   "type": "object",
   "properties": {
      "stage": {
         "$ref": "#/$defs/Stages",
         "description": "The stage of the transformation pipeline where this transform should run."
      },
      "run_per_gm": {
         "default": true,
         "description": "Whether to run the transform per graph (sub)module or on whole module.",
         "title": "Run Per Gm",
         "type": "boolean"
      },
      "enabled": {
         "default": true,
         "description": "Whether to enable this transform.",
         "title": "Enabled",
         "type": "boolean"
      },
      "skip_on_error": {
         "default": false,
         "description": "Whether to skip the transform if an error occurs.",
         "title": "Skip On Error",
         "type": "boolean"
      },
      "run_graph_cleanup": {
         "default": true,
         "description": "Whether to run graph cleanup/canonicalization after this transform.",
         "title": "Run Graph Cleanup",
         "type": "boolean"
      },
      "run_shape_prop": {
         "default": false,
         "description": "Whether to run shape propagation after this transform.",
         "title": "Run Shape Prop",
         "type": "boolean"
      },
      "requires_clean_graph": {
         "default": true,
         "description": "Whether this transform requires the graph to be clean before it is applied.",
         "title": "Requires Clean Graph",
         "type": "boolean"
      },
      "requires_shape_prop": {
         "default": false,
         "description": "Whether this transform requires shape propagation before it is applied.",
         "title": "Requires Shape Prop",
         "type": "boolean"
      },
      "debug_visualize_dir": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.",
         "title": "Debug Visualize Dir"
      },
      "expect_mem_change": {
         "default": false,
         "description": "Whether this transform is expected to cause changes in CUDA memory stats.",
         "title": "Expect Mem Change",
         "type": "boolean"
      }
   },
   "$defs": {
      "Stages": {
         "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.",
         "enum": [
            "factory",
            "export",
            "post_export",
            "pattern_matcher",
            "sharding",
            "weight_load",
            "post_load_fusion",
            "cache_init",
            "visualize",
            "compile"
         ],
         "title": "Stages",
         "type": "string"
      }
   },
   "additionalProperties": true,
   "required": [
      "stage"
   ]
}

Config:
  • extra: str = allow

Fields:

Fuse MoE#

Transform key: fuse_moe

Source module: tensorrt_llm._torch.auto_deploy.transform.library.fused_moe

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.fused_moe.FuseMoe(
config: TransformConfig,
)[source]#

Bases: BaseTransform

Scan the FX graph and replace all calls to torch.ops.auto_deploy.torch_moe and torch.ops.auto_deploy.torch_moe_fused with torch.ops.auto_deploy.trtllm_moe_fused.

classmethod get_config_class() Type[TransformConfig][source]#

Get the configuration class for the transform.

This is used to validate the configuration of the transform.

YAML configuration

The fields below can be set under this transform’s entry in the AutoDeploy config YAML.

pydantic model tensorrt_llm._torch.auto_deploy.transform.library.fused_moe.FuseMoeConfig[source]

Bases: TransformConfig

Configuration for MoE fusion transform.

Show JSON schema
{
   "title": "FuseMoeConfig",
   "description": "Configuration for MoE fusion transform.",
   "type": "object",
   "properties": {
      "stage": {
         "$ref": "#/$defs/Stages",
         "description": "The stage of the transformation pipeline where this transform should run."
      },
      "run_per_gm": {
         "default": true,
         "description": "Whether to run the transform per graph (sub)module or on whole module.",
         "title": "Run Per Gm",
         "type": "boolean"
      },
      "enabled": {
         "default": true,
         "description": "Whether to enable this transform.",
         "title": "Enabled",
         "type": "boolean"
      },
      "skip_on_error": {
         "default": false,
         "description": "Whether to skip the transform if an error occurs.",
         "title": "Skip On Error",
         "type": "boolean"
      },
      "run_graph_cleanup": {
         "default": true,
         "description": "Whether to run graph cleanup/canonicalization after this transform.",
         "title": "Run Graph Cleanup",
         "type": "boolean"
      },
      "run_shape_prop": {
         "default": false,
         "description": "Whether to run shape propagation after this transform.",
         "title": "Run Shape Prop",
         "type": "boolean"
      },
      "requires_clean_graph": {
         "default": true,
         "description": "Whether this transform requires the graph to be clean before it is applied.",
         "title": "Requires Clean Graph",
         "type": "boolean"
      },
      "requires_shape_prop": {
         "default": false,
         "description": "Whether this transform requires shape propagation before it is applied.",
         "title": "Requires Shape Prop",
         "type": "boolean"
      },
      "debug_visualize_dir": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.",
         "title": "Debug Visualize Dir"
      },
      "expect_mem_change": {
         "default": false,
         "description": "Whether this transform is expected to cause changes in CUDA memory stats.",
         "title": "Expect Mem Change",
         "type": "boolean"
      },
      "backend": {
         "default": "auto",
         "description": "Backend to use for MoE computation ('auto', 'trtllm' or 'triton'. default: 'auto').",
         "title": "Backend",
         "type": "string"
      }
   },
   "$defs": {
      "Stages": {
         "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.",
         "enum": [
            "factory",
            "export",
            "post_export",
            "pattern_matcher",
            "sharding",
            "weight_load",
            "post_load_fusion",
            "cache_init",
            "visualize",
            "compile"
         ],
         "title": "Stages",
         "type": "string"
      }
   },
   "additionalProperties": true,
   "required": [
      "stage"
   ]
}

Config:
  • extra: str = allow

Fields:
  • backend (str)

field backend: str = 'auto'

Backend to use for MoE computation (‘auto’, ‘trtllm’ or ‘triton’. default: ‘auto’).

Fuse FP8 MoE#

Transform key: fuse_fp8_moe

Source module: tensorrt_llm._torch.auto_deploy.transform.library.fused_moe

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.fused_moe.FuseFP8Moe(
config: TransformConfig,
)[source]#

Bases: BaseTransform

Stack per-expert FP8 MoE weights and scales to avoid runtime stacking overhead. This runs after weights are loaded, similar to FuseMoe for unquantized MoE.

classmethod get_config_class() Type[TransformConfig][source]#

Get the configuration class for the transform.

This is used to validate the configuration of the transform.

YAML configuration

The fields below can be set under this transform’s entry in the AutoDeploy config YAML.

pydantic model tensorrt_llm._torch.auto_deploy.transform.library.fused_moe.FuseFP8MoeConfig[source]

Bases: TransformConfig

Configuration for FP8 MoE fusion transform.

Show JSON schema
{
   "title": "FuseFP8MoeConfig",
   "description": "Configuration for FP8 MoE fusion transform.",
   "type": "object",
   "properties": {
      "stage": {
         "$ref": "#/$defs/Stages",
         "description": "The stage of the transformation pipeline where this transform should run."
      },
      "run_per_gm": {
         "default": true,
         "description": "Whether to run the transform per graph (sub)module or on whole module.",
         "title": "Run Per Gm",
         "type": "boolean"
      },
      "enabled": {
         "default": true,
         "description": "Whether to enable this transform.",
         "title": "Enabled",
         "type": "boolean"
      },
      "skip_on_error": {
         "default": false,
         "description": "Whether to skip the transform if an error occurs.",
         "title": "Skip On Error",
         "type": "boolean"
      },
      "run_graph_cleanup": {
         "default": true,
         "description": "Whether to run graph cleanup/canonicalization after this transform.",
         "title": "Run Graph Cleanup",
         "type": "boolean"
      },
      "run_shape_prop": {
         "default": false,
         "description": "Whether to run shape propagation after this transform.",
         "title": "Run Shape Prop",
         "type": "boolean"
      },
      "requires_clean_graph": {
         "default": true,
         "description": "Whether this transform requires the graph to be clean before it is applied.",
         "title": "Requires Clean Graph",
         "type": "boolean"
      },
      "requires_shape_prop": {
         "default": false,
         "description": "Whether this transform requires shape propagation before it is applied.",
         "title": "Requires Shape Prop",
         "type": "boolean"
      },
      "debug_visualize_dir": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.",
         "title": "Debug Visualize Dir"
      },
      "expect_mem_change": {
         "default": false,
         "description": "Whether this transform is expected to cause changes in CUDA memory stats.",
         "title": "Expect Mem Change",
         "type": "boolean"
      },
      "backend": {
         "default": "auto",
         "description": "Backend to use for FP8 MoE computation ('auto', 'trtllm' or 'triton'. default: 'auto').",
         "title": "Backend",
         "type": "string"
      },
      "allow_different_input_scales": {
         "default": false,
         "description": "If False (default), assert that all experts have identical input scales and fail if not. If True, allow different per-expert input scales by using max(input_scale) for quantization. This matches TRT-LLM manual backend behavior but may impact accuracy if scales differ significantly.",
         "title": "Allow Different Input Scales",
         "type": "boolean"
      }
   },
   "$defs": {
      "Stages": {
         "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.",
         "enum": [
            "factory",
            "export",
            "post_export",
            "pattern_matcher",
            "sharding",
            "weight_load",
            "post_load_fusion",
            "cache_init",
            "visualize",
            "compile"
         ],
         "title": "Stages",
         "type": "string"
      }
   },
   "additionalProperties": true,
   "required": [
      "stage"
   ]
}

Config:
  • extra: str = allow

Fields:
  • allow_different_input_scales (bool)

  • backend (str)

field allow_different_input_scales: bool = False

If False (default), assert that all experts have identical input scales and fail if not. If True, allow different per-expert input scales by using max(input_scale) for quantization. This matches TRT-LLM manual backend behavior but may impact accuracy if scales differ significantly.

field backend: str = 'auto'

Backend to use for FP8 MoE computation (‘auto’, ‘trtllm’ or ‘triton’. default: ‘auto’).

Fuse Finegrained FP8 MoE#

Transform key: fuse_finegrained_fp8_moe

Source module: tensorrt_llm._torch.auto_deploy.transform.library.fused_moe

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.fused_moe.FuseFineGrainedFP8Moe(
config: TransformConfig,
)[source]#

Bases: BaseTransform

Stack per-expert FineGrainedFP8 MoE weights and block scales.

This transform replaces torch_quant_finegrained_fp8_moe ops with the fused trtllm_quant_finegrained_fp8_moe_fused kernel which is cudagraph-compatible.

YAML configuration

Uses the common TransformConfig fields documented in Core Transform APIs.

Fuse NVFP4 MoE#

Transform key: fuse_nvfp4_moe

Source module: tensorrt_llm._torch.auto_deploy.transform.library.fused_moe

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.fused_moe.FuseNVFP4Moe(
config: TransformConfig,
)[source]#

Bases: BaseTransform

Stack per-expert NVFP4 MoE weights and scales to avoid runtime stacking overhead. This runs after weights are loaded, similar to FuseFP8Moe.

classmethod get_config_class() Type[TransformConfig][source]#

Get the configuration class for the transform.

This is used to validate the configuration of the transform.

YAML configuration

The fields below can be set under this transform’s entry in the AutoDeploy config YAML.

pydantic model tensorrt_llm._torch.auto_deploy.transform.library.fused_moe.FuseNVFP4MoeConfig[source]

Bases: TransformConfig

Configuration for NVFP4 MoE fusion transform.

Show JSON schema
{
   "title": "FuseNVFP4MoeConfig",
   "description": "Configuration for NVFP4 MoE fusion transform.",
   "type": "object",
   "properties": {
      "stage": {
         "$ref": "#/$defs/Stages",
         "description": "The stage of the transformation pipeline where this transform should run."
      },
      "run_per_gm": {
         "default": true,
         "description": "Whether to run the transform per graph (sub)module or on whole module.",
         "title": "Run Per Gm",
         "type": "boolean"
      },
      "enabled": {
         "default": true,
         "description": "Whether to enable this transform.",
         "title": "Enabled",
         "type": "boolean"
      },
      "skip_on_error": {
         "default": false,
         "description": "Whether to skip the transform if an error occurs.",
         "title": "Skip On Error",
         "type": "boolean"
      },
      "run_graph_cleanup": {
         "default": true,
         "description": "Whether to run graph cleanup/canonicalization after this transform.",
         "title": "Run Graph Cleanup",
         "type": "boolean"
      },
      "run_shape_prop": {
         "default": false,
         "description": "Whether to run shape propagation after this transform.",
         "title": "Run Shape Prop",
         "type": "boolean"
      },
      "requires_clean_graph": {
         "default": true,
         "description": "Whether this transform requires the graph to be clean before it is applied.",
         "title": "Requires Clean Graph",
         "type": "boolean"
      },
      "requires_shape_prop": {
         "default": false,
         "description": "Whether this transform requires shape propagation before it is applied.",
         "title": "Requires Shape Prop",
         "type": "boolean"
      },
      "debug_visualize_dir": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.",
         "title": "Debug Visualize Dir"
      },
      "expect_mem_change": {
         "default": false,
         "description": "Whether this transform is expected to cause changes in CUDA memory stats.",
         "title": "Expect Mem Change",
         "type": "boolean"
      },
      "backend": {
         "default": "cutlass",
         "description": "Backend to use for NVFP4 MoE computation ('cutlass' or 'trtllm_gen').",
         "enum": [
            "cutlass",
            "trtllm_gen"
         ],
         "title": "Backend",
         "type": "string"
      },
      "allow_different_input_scales": {
         "default": false,
         "description": "If False (default), assert that all experts have identical input scales and fail if not. If True, allow different per-expert input scales by using min(input_scale) for quantization. Note: NVFP4 uses min() (not max like FP8) because scales are in kernel format (2688/amax): smaller scale = larger amax = larger dynamic range. This may impact accuracy if scales differ significantly.",
         "title": "Allow Different Input Scales",
         "type": "boolean"
      },
      "reverse_interleaved_input_scales": {
         "default": true,
         "description": "If True, assumes incoming NVFP4 block scales are already interleaved (as produced by quantization load_hook), applies block_scale_interleave_reverse before TRTLLM-Gen shuffle+interleave. Only used when backend='trtllm_gen'.",
         "title": "Reverse Interleaved Input Scales",
         "type": "boolean"
      },
      "enable_trtllm_gen_internal_routing": {
         "default": true,
         "description": "If True and backend='trtllm_gen', pass router logits and routing bias directly to TRTLLM-Gen MoE when the routing tensors come from trtllm.noaux_tc_op.",
         "title": "Enable Trtllm Gen Internal Routing",
         "type": "boolean"
      }
   },
   "$defs": {
      "Stages": {
         "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.",
         "enum": [
            "factory",
            "export",
            "post_export",
            "pattern_matcher",
            "sharding",
            "weight_load",
            "post_load_fusion",
            "cache_init",
            "visualize",
            "compile"
         ],
         "title": "Stages",
         "type": "string"
      }
   },
   "additionalProperties": true,
   "required": [
      "stage"
   ]
}

Config:
  • extra: str = allow

Fields:
  • allow_different_input_scales (bool)

  • backend (Literal['cutlass', 'trtllm_gen'])

  • enable_trtllm_gen_internal_routing (bool)

  • reverse_interleaved_input_scales (bool)

field allow_different_input_scales: bool = False

If False (default), assert that all experts have identical input scales and fail if not. If True, allow different per-expert input scales by using min(input_scale) for quantization. Note: NVFP4 uses min() (not max like FP8) because scales are in kernel format (2688/amax): smaller scale = larger amax = larger dynamic range. This may impact accuracy if scales differ significantly.

field backend: Literal['cutlass', 'trtllm_gen'] = 'cutlass'

Backend to use for NVFP4 MoE computation (‘cutlass’ or ‘trtllm_gen’).

field enable_trtllm_gen_internal_routing: bool = True

If True and backend=’trtllm_gen’, pass router logits and routing bias directly to TRTLLM-Gen MoE when the routing tensors come from trtllm.noaux_tc_op.

field reverse_interleaved_input_scales: bool = True

If True, assumes incoming NVFP4 block scales are already interleaved (as produced by quantization load_hook), applies block_scale_interleave_reverse before TRTLLM-Gen shuffle+interleave. Only used when backend=’trtllm_gen’.

Fuse Allreduce Residual RMSNorm#

Transform key: fuse_allreduce_residual_rmsnorm

Source module: tensorrt_llm._torch.auto_deploy.transform.library.collectives

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.collectives.FuseAllreduceResidualRMSNorm(
config: TransformConfig,
)[source]#

Bases: BaseTransform

Fuse (allreduce + residual add + RMSNorm) into one fused op with tuple output.

This transform only applies when TRT-LLM ops are used (MPI mode), as it provides optimized fused kernels. The torch backend (demollm mode) does not benefit from this fusion and uses unfused operations.

Note: This transform expects torch_rmsnorm ops in the graph, which are created by the match_rmsnorm_pattern transform that runs earlier in the pipeline.

YAML configuration

Uses the common TransformConfig fields documented in Core Transform APIs.

Fuse RMSNorm#

Transform key: fuse_rmsnorm

Source module: tensorrt_llm._torch.auto_deploy.transform.library.rms_norm

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.rms_norm.FuseRMSNorm(
config: TransformConfig,
)[source]#

Bases: BaseTransform

Fuses torch_rmsnorm ops with the selected backend implementation.

This transform runs in the post_load_fusion stage and replaces torch_rmsnorm ops with the specified backend implementation (flashinfer, triton, or torch).

Parameters:
  • gm – Input graph module to transform.

  • rmsnorm_backend – Backend to use for regular RMSNorm computation (“flashinfer”, “triton”, or “torch”).

  • gated_rmsnorm_backend – Backend to use for gated RMSNorm computation (currently only “triton”).

Returns:

Transformed graph module with backend-specific RMSNorm operations.

classmethod get_config_class() Type[TransformConfig][source]#

Get the configuration class for the transform.

This is used to validate the configuration of the transform.

YAML configuration

The fields below can be set under this transform’s entry in the AutoDeploy config YAML.

pydantic model tensorrt_llm._torch.auto_deploy.transform.library.rms_norm.FuseRMSNormConfig[source]

Bases: TransformConfig

Configuration for the RMSNorm fusion transform.

Show JSON schema
{
   "title": "FuseRMSNormConfig",
   "description": "Configuration for the RMSNorm fusion transform.",
   "type": "object",
   "properties": {
      "stage": {
         "$ref": "#/$defs/Stages",
         "description": "The stage of the transformation pipeline where this transform should run."
      },
      "run_per_gm": {
         "default": true,
         "description": "Whether to run the transform per graph (sub)module or on whole module.",
         "title": "Run Per Gm",
         "type": "boolean"
      },
      "enabled": {
         "default": true,
         "description": "Whether to enable this transform.",
         "title": "Enabled",
         "type": "boolean"
      },
      "skip_on_error": {
         "default": false,
         "description": "Whether to skip the transform if an error occurs.",
         "title": "Skip On Error",
         "type": "boolean"
      },
      "run_graph_cleanup": {
         "default": true,
         "description": "Whether to run graph cleanup/canonicalization after this transform.",
         "title": "Run Graph Cleanup",
         "type": "boolean"
      },
      "run_shape_prop": {
         "default": false,
         "description": "Whether to run shape propagation after this transform.",
         "title": "Run Shape Prop",
         "type": "boolean"
      },
      "requires_clean_graph": {
         "default": true,
         "description": "Whether this transform requires the graph to be clean before it is applied.",
         "title": "Requires Clean Graph",
         "type": "boolean"
      },
      "requires_shape_prop": {
         "default": false,
         "description": "Whether this transform requires shape propagation before it is applied.",
         "title": "Requires Shape Prop",
         "type": "boolean"
      },
      "debug_visualize_dir": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.",
         "title": "Debug Visualize Dir"
      },
      "expect_mem_change": {
         "default": false,
         "description": "Whether this transform is expected to cause changes in CUDA memory stats.",
         "title": "Expect Mem Change",
         "type": "boolean"
      },
      "rmsnorm_backend": {
         "default": "flashinfer",
         "description": "Backend to use for RMSNorm computation ('flashinfer', 'triton', or 'torch').",
         "title": "Rmsnorm Backend",
         "type": "string"
      },
      "gated_rmsnorm_backend": {
         "default": "triton",
         "description": "Backend to use for gated RMSNorm computation (currently only 'triton').",
         "title": "Gated Rmsnorm Backend",
         "type": "string"
      }
   },
   "$defs": {
      "Stages": {
         "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.",
         "enum": [
            "factory",
            "export",
            "post_export",
            "pattern_matcher",
            "sharding",
            "weight_load",
            "post_load_fusion",
            "cache_init",
            "visualize",
            "compile"
         ],
         "title": "Stages",
         "type": "string"
      }
   },
   "additionalProperties": true,
   "required": [
      "stage"
   ]
}

Config:
  • extra: str = allow

Fields:
  • gated_rmsnorm_backend (str)

  • rmsnorm_backend (str)

field gated_rmsnorm_backend: str = 'triton'

Backend to use for gated RMSNorm computation (currently only ‘triton’).

field rmsnorm_backend: str = 'flashinfer'

Backend to use for RMSNorm computation (‘flashinfer’, ‘triton’, or ‘torch’).

Fuse RMSNorm Quant NVFP4#

Transform key: fuse_rmsnorm_quant_nvfp4

Source module: tensorrt_llm._torch.auto_deploy.transform.library.fuse_rmsnorm_quant_nvfp4

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.fuse_rmsnorm_quant_nvfp4.FuseRMSNormQuantNVFP4(
config: TransformConfig,
)[source]#

Bases: BaseTransform

Fuse NVFP4 quantization into RMSNorm producers where TRT-LLM kernels exist.

classmethod get_config_class() Type[TransformConfig][source]#

Get the configuration class for the transform.

This is used to validate the configuration of the transform.

YAML configuration

Uses the common TransformConfig fields documented in Core Transform APIs.

Fuse GDN Gating#

Transform key: fuse_gdn_gating

Source module: tensorrt_llm._torch.auto_deploy.transform.library.fuse_gdn_gating

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.fuse_gdn_gating.FuseGdnGating(
config: TransformConfig,
)[source]#

Bases: BaseTransform

Replaces torch_fused_gdn_gating ops with triton_fused_gdn_gating.

This transform runs in the post_load_fusion stage and swaps the pure-torch source op with a single-kernel Triton implementation, eliminating ~5 kernel launches per GDN layer.

Parameters:

gm – Input graph module to transform.

Returns:

Transformed graph module with Triton-fused GDN gating operations.

classmethod get_config_class() Type[TransformConfig][source]#

Get the configuration class for the transform.

This is used to validate the configuration of the transform.

YAML configuration

Uses the common TransformConfig fields documented in Core Transform APIs.

Fuse L2Norm#

Transform key: fuse_l2norm

Source module: tensorrt_llm._torch.auto_deploy.transform.library.l2_norm

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.l2_norm.FuseL2Norm(
config: TransformConfig,
)[source]#

Bases: BaseTransform

Fuses torch_l2norm ops with the selected backend implementation.

This transform runs in the post_load_fusion stage and replaces torch_l2norm ops with the specified backend implementation (fla or torch).

Parameters:
  • gm – Input graph module to transform.

  • backend – Backend to use for L2Norm computation (“fla” or “torch”).

Returns:

Transformed graph module with backend-specific L2Norm operations.

classmethod get_config_class() Type[TransformConfig][source]#

Get the configuration class for the transform.

This is used to validate the configuration of the transform.

YAML configuration

The fields below can be set under this transform’s entry in the AutoDeploy config YAML.

pydantic model tensorrt_llm._torch.auto_deploy.transform.library.l2_norm.FuseL2NormConfig[source]

Bases: TransformConfig

Configuration for the L2Norm fusion transform.

Show JSON schema
{
   "title": "FuseL2NormConfig",
   "description": "Configuration for the L2Norm fusion transform.",
   "type": "object",
   "properties": {
      "stage": {
         "$ref": "#/$defs/Stages",
         "description": "The stage of the transformation pipeline where this transform should run."
      },
      "run_per_gm": {
         "default": true,
         "description": "Whether to run the transform per graph (sub)module or on whole module.",
         "title": "Run Per Gm",
         "type": "boolean"
      },
      "enabled": {
         "default": true,
         "description": "Whether to enable this transform.",
         "title": "Enabled",
         "type": "boolean"
      },
      "skip_on_error": {
         "default": false,
         "description": "Whether to skip the transform if an error occurs.",
         "title": "Skip On Error",
         "type": "boolean"
      },
      "run_graph_cleanup": {
         "default": true,
         "description": "Whether to run graph cleanup/canonicalization after this transform.",
         "title": "Run Graph Cleanup",
         "type": "boolean"
      },
      "run_shape_prop": {
         "default": false,
         "description": "Whether to run shape propagation after this transform.",
         "title": "Run Shape Prop",
         "type": "boolean"
      },
      "requires_clean_graph": {
         "default": true,
         "description": "Whether this transform requires the graph to be clean before it is applied.",
         "title": "Requires Clean Graph",
         "type": "boolean"
      },
      "requires_shape_prop": {
         "default": false,
         "description": "Whether this transform requires shape propagation before it is applied.",
         "title": "Requires Shape Prop",
         "type": "boolean"
      },
      "debug_visualize_dir": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.",
         "title": "Debug Visualize Dir"
      },
      "expect_mem_change": {
         "default": false,
         "description": "Whether this transform is expected to cause changes in CUDA memory stats.",
         "title": "Expect Mem Change",
         "type": "boolean"
      },
      "backend": {
         "default": "fla",
         "description": "Backend to use for L2Norm computation ('fla' or 'torch').",
         "enum": [
            "torch",
            "fla"
         ],
         "title": "Backend",
         "type": "string"
      }
   },
   "$defs": {
      "Stages": {
         "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.",
         "enum": [
            "factory",
            "export",
            "post_export",
            "pattern_matcher",
            "sharding",
            "weight_load",
            "post_load_fusion",
            "cache_init",
            "visualize",
            "compile"
         ],
         "title": "Stages",
         "type": "string"
      }
   },
   "additionalProperties": true,
   "required": [
      "stage"
   ]
}

Config:
  • extra: str = allow

Fields:
  • backend (Literal['torch', 'fla'])

field backend: Literal['torch', 'fla'] = 'fla'

Backend to use for L2Norm computation (‘fla’ or ‘torch’).

Fuse SiLU Mul#

Transform key: fuse_silu_mul

Source module: tensorrt_llm._torch.auto_deploy.transform.library.fuse_silu_mul

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.fuse_silu_mul.FuseSiluMul(
config: TransformConfig,
)[source]#

Bases: BaseTransform

Fuse narrow+silu+mul into a single silu_and_mul op after GEMM fusion.

Detects the pattern:

gate = narrow(x, -1, 0, size) up = narrow(x, -1, size, size) hidden = silu(gate) * up

And replaces it with:

hidden = silu_and_mul(x)

When backend='trtllm', also detects if the sole consumer is an FP8 linear and fuses quantization into the kernel (eliminating a separate scaleMatrix pass).

This runs as a post_load_fusion pass, after GEMM fusion has combined gate+up projections.

classmethod get_config_class() Type[TransformConfig][source]#

Get the configuration class for the transform.

This is used to validate the configuration of the transform.

YAML configuration

The fields below can be set under this transform’s entry in the AutoDeploy config YAML.

pydantic model tensorrt_llm._torch.auto_deploy.transform.library.fuse_silu_mul.FuseSiluMulConfig[source]

Bases: TransformConfig

Configuration for the SiLU+Mul fusion transform.

Show JSON schema
{
   "title": "FuseSiluMulConfig",
   "description": "Configuration for the SiLU+Mul fusion transform.",
   "type": "object",
   "properties": {
      "stage": {
         "$ref": "#/$defs/Stages",
         "description": "The stage of the transformation pipeline where this transform should run."
      },
      "run_per_gm": {
         "default": true,
         "description": "Whether to run the transform per graph (sub)module or on whole module.",
         "title": "Run Per Gm",
         "type": "boolean"
      },
      "enabled": {
         "default": true,
         "description": "Whether to enable this transform.",
         "title": "Enabled",
         "type": "boolean"
      },
      "skip_on_error": {
         "default": false,
         "description": "Whether to skip the transform if an error occurs.",
         "title": "Skip On Error",
         "type": "boolean"
      },
      "run_graph_cleanup": {
         "default": true,
         "description": "Whether to run graph cleanup/canonicalization after this transform.",
         "title": "Run Graph Cleanup",
         "type": "boolean"
      },
      "run_shape_prop": {
         "default": false,
         "description": "Whether to run shape propagation after this transform.",
         "title": "Run Shape Prop",
         "type": "boolean"
      },
      "requires_clean_graph": {
         "default": true,
         "description": "Whether this transform requires the graph to be clean before it is applied.",
         "title": "Requires Clean Graph",
         "type": "boolean"
      },
      "requires_shape_prop": {
         "default": false,
         "description": "Whether this transform requires shape propagation before it is applied.",
         "title": "Requires Shape Prop",
         "type": "boolean"
      },
      "debug_visualize_dir": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.",
         "title": "Debug Visualize Dir"
      },
      "expect_mem_change": {
         "default": false,
         "description": "Whether this transform is expected to cause changes in CUDA memory stats.",
         "title": "Expect Mem Change",
         "type": "boolean"
      },
      "backend": {
         "default": "flashinfer",
         "description": "Backend for fused SiLU+Mul kernel. 'flashinfer' (default) or 'trtllm' (faster, supports fused FP8 quant).",
         "title": "Backend",
         "type": "string"
      }
   },
   "$defs": {
      "Stages": {
         "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.",
         "enum": [
            "factory",
            "export",
            "post_export",
            "pattern_matcher",
            "sharding",
            "weight_load",
            "post_load_fusion",
            "cache_init",
            "visualize",
            "compile"
         ],
         "title": "Stages",
         "type": "string"
      }
   },
   "additionalProperties": true,
   "required": [
      "stage"
   ]
}

Config:
  • extra: str = allow

Fields:
  • backend (str)

field backend: str = 'flashinfer'

Backend for fused SiLU+Mul kernel. ‘flashinfer’ (default) or ‘trtllm’ (faster, supports fused FP8 quant).

Fuse SwiGLU#

Transform key: fuse_swiglu

Source module: tensorrt_llm._torch.auto_deploy.transform.library.fuse_swiglu

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.fuse_swiglu.FuseSwiGLU(
config: TransformConfig,
)[source]#

Bases: BaseTransform

Fuses torch_swiglu_mlp ops by concatenating gate and up weights.

This transform runs in the post_load_fusion stage and replaces torch_swiglu_mlp ops with fused_swiglu_mlp ops that use a single concatenated gate+up weight matrix.

This reduces memory bandwidth by performing a single matmul instead of two separate matmuls for gate and up projections.

classmethod get_config_class() Type[FuseSwiGLUConfig][source]#

Get the configuration class for the transform.

This is used to validate the configuration of the transform.

YAML configuration

The fields below can be set under this transform’s entry in the AutoDeploy config YAML.

pydantic model tensorrt_llm._torch.auto_deploy.transform.library.fuse_swiglu.FuseSwiGLUConfig[source]

Bases: TransformConfig

Configuration for the SwiGLU fusion transform.

Show JSON schema
{
   "title": "FuseSwiGLUConfig",
   "description": "Configuration for the SwiGLU fusion transform.",
   "type": "object",
   "properties": {
      "stage": {
         "$ref": "#/$defs/Stages",
         "description": "The stage of the transformation pipeline where this transform should run."
      },
      "run_per_gm": {
         "default": true,
         "description": "Whether to run the transform per graph (sub)module or on whole module.",
         "title": "Run Per Gm",
         "type": "boolean"
      },
      "enabled": {
         "default": true,
         "description": "Whether to enable SwiGLU fusion.",
         "title": "Enabled",
         "type": "boolean"
      },
      "skip_on_error": {
         "default": false,
         "description": "Whether to skip the transform if an error occurs.",
         "title": "Skip On Error",
         "type": "boolean"
      },
      "run_graph_cleanup": {
         "default": true,
         "description": "Whether to run graph cleanup/canonicalization after this transform.",
         "title": "Run Graph Cleanup",
         "type": "boolean"
      },
      "run_shape_prop": {
         "default": false,
         "description": "Whether to run shape propagation after this transform.",
         "title": "Run Shape Prop",
         "type": "boolean"
      },
      "requires_clean_graph": {
         "default": true,
         "description": "Whether this transform requires the graph to be clean before it is applied.",
         "title": "Requires Clean Graph",
         "type": "boolean"
      },
      "requires_shape_prop": {
         "default": false,
         "description": "Whether this transform requires shape propagation before it is applied.",
         "title": "Requires Shape Prop",
         "type": "boolean"
      },
      "debug_visualize_dir": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.",
         "title": "Debug Visualize Dir"
      },
      "expect_mem_change": {
         "default": false,
         "description": "Whether this transform is expected to cause changes in CUDA memory stats.",
         "title": "Expect Mem Change",
         "type": "boolean"
      }
   },
   "$defs": {
      "Stages": {
         "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.",
         "enum": [
            "factory",
            "export",
            "post_export",
            "pattern_matcher",
            "sharding",
            "weight_load",
            "post_load_fusion",
            "cache_init",
            "visualize",
            "compile"
         ],
         "title": "Stages",
         "type": "string"
      }
   },
   "additionalProperties": true,
   "required": [
      "stage"
   ]
}

Config:
  • extra: str = allow

Fields:
  • enabled (bool)

field enabled: bool = True

Whether to enable SwiGLU fusion.

Fuse Add Rms Norm#

Transform key: fuse_add_rms_norm

Source module: tensorrt_llm._torch.auto_deploy.transform.library.fused_add_rms_norm

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.fused_add_rms_norm.FuseAddRMSNorm(
config: TransformConfig,
)[source]#

Bases: BaseTransform

Fuse (add + optional cast + RMSNorm) into one fused op.

Uses direct FX graph manipulation instead of the inductor pattern matcher to correctly handle patterns where intermediate nodes (add, rms_norm) have multiple users in the graph.

Pattern 1 (without cast):

%add = aten.add(%x, %residual) %norm = flashinfer_rms_norm(%add, %weight, eps)

Pattern 2 (with cast):

%add = aten.add(%x, %residual) %cast = aten.to.dtype(%add, bfloat16) %norm = flashinfer_rms_norm(%cast, %weight, eps)

Both are replaced with:

%fused = flashinfer_fused_add_rms_norm(%x, %residual, %weight, eps) %norm_out = getitem(%fused, 0) # norm result (replaces %norm) %add_out = getitem(%fused, 1) # add result (replaces %add)

YAML configuration

Uses the common TransformConfig fields documented in Core Transform APIs.

Fuse RMSNorm Quant FP8#

Transform key: fuse_rmsnorm_quant_fp8

Source module: tensorrt_llm._torch.auto_deploy.transform.library.fuse_rmsnorm_quant_fp8

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.fuse_rmsnorm_quant_fp8.FuseRMSNormQuantFP8(
config: TransformConfig,
)[source]#

Bases: BaseTransform

classmethod get_config_class() Type[TransformConfig][source]#

Get the configuration class for the transform.

This is used to validate the configuration of the transform.

YAML configuration

Uses the common TransformConfig fields documented in Core Transform APIs.

Gather Logits Before Lm Head#

Transform key: gather_logits_before_lm_head

Source module: tensorrt_llm._torch.auto_deploy.transform.library.gather_logits_before_lm_head

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.gather_logits_before_lm_head.GatherLogitsBeforeLmHeadTransform(
config: TransformConfig,
)[source]#

Bases: BaseTransform

Transform to gather hidden states before LM head using logits_gather_mask.

This transform inserts a gather operation before the LM head linear layer to select only the hidden states that need logits computed. The output is always [b, hidden_size] in decode-only for CUDA graph compatibility.

Benefits: - Reduces computation by only computing logits for needed tokens - Eliminates Python loop overhead - Enables CUDA graph capture of the gather - Moves gather into the graph for better optimization

YAML configuration

Uses the common TransformConfig fields documented in Core Transform APIs.

Fuse RoPE Into TRT-LLM Attention#

Transform key: fuse_rope_into_trtllm_attention

Source module: tensorrt_llm._torch.auto_deploy.transform.library.fuse_rope_into_trtllm_attention

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.fuse_rope_into_trtllm_attention.FuseRopeIntoTrtllmAttention(
config: TransformConfig,
)[source]#

Bases: BaseTransform

Fuse RoPE into trtllm attention by rewiring Q/K and storing rope metadata.

Runs at post_load_fusion before optimize_rope, matching the backend-agnostic torch_rope_* IR ops directly with real (non-meta) weights. DCE after this transform removes dead rope nodes; optimize_rope then handles remaining torch_rope_* for non-trtllm backends.

Stores the thop-format cos_sin tensor in attn_node.meta. TrtllmAttention.prepare_node_for_cache_insertion at cache_init materializes it as a graph node.

Disabled by default; enable in model configs that use attn_backend: trtllm.

classmethod get_config_class() Type[TransformConfig][source]#

Get the configuration class for the transform.

This is used to validate the configuration of the transform.

YAML configuration

The fields below can be set under this transform’s entry in the AutoDeploy config YAML.

pydantic model tensorrt_llm._torch.auto_deploy.transform.library.fuse_rope_into_trtllm_attention.FuseRopeIntoTrtllmAttentionConfig[source]

Bases: TransformConfig

Configuration for fuse_rope_into_trtllm_attention.

Show JSON schema
{
   "title": "FuseRopeIntoTrtllmAttentionConfig",
   "description": "Configuration for ``fuse_rope_into_trtllm_attention``.",
   "type": "object",
   "properties": {
      "stage": {
         "$ref": "#/$defs/Stages",
         "description": "The stage of the transformation pipeline where this transform should run."
      },
      "run_per_gm": {
         "default": true,
         "description": "Whether to run the transform per graph (sub)module or on whole module.",
         "title": "Run Per Gm",
         "type": "boolean"
      },
      "enabled": {
         "default": true,
         "description": "Whether to enable this transform.",
         "title": "Enabled",
         "type": "boolean"
      },
      "skip_on_error": {
         "default": false,
         "description": "Whether to skip the transform if an error occurs.",
         "title": "Skip On Error",
         "type": "boolean"
      },
      "run_graph_cleanup": {
         "default": true,
         "description": "Whether to run graph cleanup/canonicalization after this transform.",
         "title": "Run Graph Cleanup",
         "type": "boolean"
      },
      "run_shape_prop": {
         "default": false,
         "description": "Whether to run shape propagation after this transform.",
         "title": "Run Shape Prop",
         "type": "boolean"
      },
      "requires_clean_graph": {
         "default": true,
         "description": "Whether this transform requires the graph to be clean before it is applied.",
         "title": "Requires Clean Graph",
         "type": "boolean"
      },
      "requires_shape_prop": {
         "default": false,
         "description": "Whether this transform requires shape propagation before it is applied.",
         "title": "Requires Shape Prop",
         "type": "boolean"
      },
      "debug_visualize_dir": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.",
         "title": "Debug Visualize Dir"
      },
      "expect_mem_change": {
         "default": false,
         "description": "Whether this transform is expected to cause changes in CUDA memory stats.",
         "title": "Expect Mem Change",
         "type": "boolean"
      },
      "fuse_qkv_passthrough": {
         "default": true,
         "description": "When the pre-RoPE Q/K/V trace back to a single fused QKV GEMM, rewire all three to that flat tensor so ``trtllm_mha_with_cache`` can skip the per-layer split \u2192 reshape \u2192 cat path.",
         "title": "Fuse Qkv Passthrough",
         "type": "boolean"
      }
   },
   "$defs": {
      "Stages": {
         "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.",
         "enum": [
            "factory",
            "export",
            "post_export",
            "pattern_matcher",
            "sharding",
            "weight_load",
            "post_load_fusion",
            "cache_init",
            "visualize",
            "compile"
         ],
         "title": "Stages",
         "type": "string"
      }
   },
   "additionalProperties": true,
   "required": [
      "stage"
   ]
}

Config:
  • extra: str = allow

Fields:
  • fuse_qkv_passthrough (bool)

field fuse_qkv_passthrough: bool = True

When the pre-RoPE Q/K/V trace back to a single fused QKV GEMM, rewire all three to that flat tensor so trtllm_mha_with_cache can skip the per-layer split → reshape → cat path.

Fuse RoPE Into TRT-LLM MLA#

Transform key: fuse_rope_into_trtllm_mla

Source module: tensorrt_llm._torch.auto_deploy.transform.library.fuse_rope_mla

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.fuse_rope_mla.FuseRopeIntoTrtllmMLA(
config: TransformConfig,
)[source]#

Bases: BaseTransform

Fuse RoPE into TRT-LLM MLA attention for decode performance.

Runs at post_load_fusion before optimize_rope, matching the backend-agnostic torch_rope_* IR ops on torch_mla source nodes. Rewires q_pe/kpe to pre-RoPE inputs and stashes the rotary_cos_sin tensor in node.meta for later materialization at cache_init by TrtllmMLAAttention.prepare_node_for_cache_insertion.

YAML configuration

Uses the common TransformConfig fields documented in Core Transform APIs.

Optimize RoPE#

Transform key: optimize_rope

Source module: tensorrt_llm._torch.auto_deploy.transform.library.rope

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.rope.OptimizeRope(
config: TransformConfig,
)[source]#

Bases: BaseTransform

Scan the FX graph and replace calls to the torch-reference RoPE ops with optimized kernels: - torch_rope_with_explicit_cos_sinflashinfer_rope - torch_rope_with_complex_freqsflashinfer_rope - torch_rope_with_qk_interleavingtriton_rope_on_interleaved_qk_inputs

Precomputes positional IDs and the fused cosine-sine cache as explicit nodes, and reuses those nodes when possible.

YAML configuration

Uses the common TransformConfig fields documented in Core Transform APIs.

MLIR Elementwise Fusion#

Transform key: mlir_elementwise_fusion

Source module: tensorrt_llm._torch.auto_deploy.transform.library.mlir_elementwise_fusion

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.mlir_elementwise_fusion.MLIRElementwiseFusion(
config: TransformConfig,
)[source]#

Bases: BaseTransform

Unified MLIR elementwise fusion: decompose + discover + codegen + replace.

This transform: 1. Converts the FX graph to MLIR (xDSL) using the ad dialect 2. Decomposes high-level ops into elementwise primitives 3. Discovers maximal fusible subgraphs 4. Generates Triton kernels for each discovered subgraph 5. Replaces subgraph ops in MLIR with fused opaque ops 6. Converts MLIR back to FX with generated kernel calls

Requires pip install xdsl. Skipped gracefully if xDSL is not installed.

classmethod get_config_class() Type[TransformConfig][source]#

Get the configuration class for the transform.

This is used to validate the configuration of the transform.

YAML configuration

The fields below can be set under this transform’s entry in the AutoDeploy config YAML.

pydantic model tensorrt_llm._torch.auto_deploy.transform.library.mlir_elementwise_fusion.MLIRElementwiseFusionConfig[source]

Bases: TransformConfig

Configuration for the MLIR elementwise fusion transform.

Show JSON schema
{
   "title": "MLIRElementwiseFusionConfig",
   "description": "Configuration for the MLIR elementwise fusion transform.",
   "type": "object",
   "properties": {
      "stage": {
         "$ref": "#/$defs/Stages",
         "description": "The stage of the transformation pipeline where this transform should run."
      },
      "run_per_gm": {
         "default": true,
         "description": "Whether to run the transform per graph (sub)module or on whole module.",
         "title": "Run Per Gm",
         "type": "boolean"
      },
      "enabled": {
         "default": true,
         "description": "Whether to enable this transform.",
         "title": "Enabled",
         "type": "boolean"
      },
      "skip_on_error": {
         "default": false,
         "description": "Whether to skip the transform if an error occurs.",
         "title": "Skip On Error",
         "type": "boolean"
      },
      "run_graph_cleanup": {
         "default": true,
         "description": "Whether to run graph cleanup/canonicalization after this transform.",
         "title": "Run Graph Cleanup",
         "type": "boolean"
      },
      "run_shape_prop": {
         "default": false,
         "description": "Whether to run shape propagation after this transform.",
         "title": "Run Shape Prop",
         "type": "boolean"
      },
      "requires_clean_graph": {
         "default": true,
         "description": "Whether this transform requires the graph to be clean before it is applied.",
         "title": "Requires Clean Graph",
         "type": "boolean"
      },
      "requires_shape_prop": {
         "default": false,
         "description": "Whether this transform requires shape propagation before it is applied.",
         "title": "Requires Shape Prop",
         "type": "boolean"
      },
      "debug_visualize_dir": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.",
         "title": "Debug Visualize Dir"
      },
      "expect_mem_change": {
         "default": false,
         "description": "Whether this transform is expected to cause changes in CUDA memory stats.",
         "title": "Expect Mem Change",
         "type": "boolean"
      },
      "bypass_ops": {
         "description": "Op names to skip during decomposition (reserved for future use).",
         "items": {
            "type": "string"
         },
         "title": "Bypass Ops",
         "type": "array"
      }
   },
   "$defs": {
      "Stages": {
         "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.",
         "enum": [
            "factory",
            "export",
            "post_export",
            "pattern_matcher",
            "sharding",
            "weight_load",
            "post_load_fusion",
            "cache_init",
            "visualize",
            "compile"
         ],
         "title": "Stages",
         "type": "string"
      }
   },
   "additionalProperties": true,
   "required": [
      "stage"
   ]
}

Config:
  • extra: str = allow

Fields:
  • bypass_ops (List[str])

field bypass_ops: List[str] [Optional]

Op names to skip during decomposition (reserved for future use).