Pattern Matching Stage#

Pattern matching canonicalizes model-specific PyTorch graphs into AutoDeploy’s standard graph representation. These transforms identify attention, MoE, normalization, quantization, activation, and layout patterns before sharding and post-load fusion run.

Match MoE Pattern#

Transform key: match_moe_pattern

Source module: tensorrt_llm._torch.auto_deploy.transform.library.fused_moe

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.fused_moe.MatchSimpleMoePattern( config: TransformConfig, )[source]#

Bases: MatchMoePattern

Match and fuse simple (unquantized) MoE subgraph.

scale_arg_indices() → Dict[str, int][source]#: Map scale names -> arg index in the matched linear op.

scale_keys() → List[str][source]#: Order of scale keys to emit into fused MoE op (e.g., [‘input_scale’,’weight_scale’,…]).

YAML configuration

Uses the common TransformConfig fields documented in Core Transform APIs.

Match Dense MoE Pattern#

Transform key: match_dense_moe_pattern

Source module: tensorrt_llm._torch.auto_deploy.transform.library.fused_moe_mxfp4

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.fused_moe_mxfp4.MatchMXFP4MoePattern( config: TransformConfig, )[source]#: Bases: BaseTransform

YAML configuration

Uses the common TransformConfig fields documented in Core Transform APIs.

Match Bmm MoE Pattern#

Transform key: match_bmm_moe_pattern

Source module: tensorrt_llm._torch.auto_deploy.transform.library.fused_moe

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.fused_moe.MatchBmmMoePattern( config: TransformConfig, )[source]#

Bases: BaseTransform

Match and fuse Llama4 MoE pattern with pre-stacked weight tensors.

This pattern uses batch matrix multiply (BMM) operations for parallel expert computation with weights already stacked across the expert dimension.

Only matches patterns where topk uses k=1 (single expert per token).

classmethod get_config_class()[source]#

Get the configuration class for the transform.

This is used to validate the configuration of the transform.

YAML configuration

The fields below can be set under this transform’s entry in the AutoDeploy config YAML.

pydantic model tensorrt_llm._torch.auto_deploy.transform.library.fused_moe.MatchBmmMoePatternConfig[source]

Bases: TransformConfig

Configuration for MatchBmmMoePattern transform.

Show JSON schema

{
   "title": "MatchBmmMoePatternConfig",
   "description": "Configuration for MatchBmmMoePattern transform.",
   "type": "object",
   "properties": {
      "stage": {
         "$ref": "#/$defs/Stages",
         "description": "The stage of the transformation pipeline where this transform should run."
      },
      "run_per_gm": {
         "default": true,
         "description": "Whether to run the transform per graph (sub)module or on whole module.",
         "title": "Run Per Gm",
         "type": "boolean"
      },
      "enabled": {
         "default": true,
         "description": "Whether to enable this transform.",
         "title": "Enabled",
         "type": "boolean"
      },
      "skip_on_error": {
         "default": false,
         "description": "Whether to skip the transform if an error occurs.",
         "title": "Skip On Error",
         "type": "boolean"
      },
      "run_graph_cleanup": {
         "default": true,
         "description": "Whether to run graph cleanup/canonicalization after this transform.",
         "title": "Run Graph Cleanup",
         "type": "boolean"
      },
      "run_shape_prop": {
         "default": false,
         "description": "Whether to run shape propagation after this transform.",
         "title": "Run Shape Prop",
         "type": "boolean"
      },
      "requires_clean_graph": {
         "default": true,
         "description": "Whether this transform requires the graph to be clean before it is applied.",
         "title": "Requires Clean Graph",
         "type": "boolean"
      },
      "requires_shape_prop": {
         "default": false,
         "description": "Whether this transform requires shape propagation before it is applied.",
         "title": "Requires Shape Prop",
         "type": "boolean"
      },
      "debug_visualize_dir": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.",
         "title": "Debug Visualize Dir"
      },
      "expect_mem_change": {
         "default": false,
         "description": "Whether this transform is expected to cause changes in CUDA memory stats.",
         "title": "Expect Mem Change",
         "type": "boolean"
      }
   },
   "$defs": {
      "Stages": {
         "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.",
         "enum": [
            "factory",
            "export",
            "post_export",
            "pattern_matcher",
            "sharding",
            "weight_load",
            "post_load_fusion",
            "cache_init",
            "visualize",
            "compile"
         ],
         "title": "Stages",
         "type": "string"
      }
   },
   "additionalProperties": true,
   "required": [
      "stage"
   ]
}

Config:

extra: str = allow

Fields:

debug_visualize_dir (str | None)
enabled (bool)
expect_mem_change (bool)
requires_clean_graph (bool)
requires_shape_prop (bool)
run_graph_cleanup (bool)
run_per_gm (bool)
run_shape_prop (bool)
skip_on_error (bool)
stage (tensorrt_llm._torch.auto_deploy.transform.interface.Stages)

field debug_visualize_dir: str | None = None: Debug visualization directory. None to disable visualization, or a path string to specify the output directory.

field enabled: bool = True: Whether to enable this transform.

field expect_mem_change: bool = False: Whether this transform is expected to cause changes in CUDA memory stats.

field requires_clean_graph: bool = True: Whether this transform requires the graph to be clean before it is applied.

field requires_shape_prop: bool = False: Whether this transform requires shape propagation before it is applied.

field run_graph_cleanup: bool = True: Whether to run graph cleanup/canonicalization after this transform.

field run_per_gm: bool = True: Whether to run the transform per graph (sub)module or on whole module.

field run_shape_prop: bool = False: Whether to run shape propagation after this transform.

field skip_on_error: bool = False: Whether to skip the transform if an error occurs.

field stage: Stages [Required]: The stage of the transformation pipeline where this transform should run.

Split MoE Fused For Sharding#

Transform key: split_moe_fused_for_sharding

Source module: tensorrt_llm._torch.auto_deploy.transform.library.fused_moe

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.fused_moe.SplitMoeFusedForSharding( config: TransformConfig, )[source]#

Bases: BaseTransform

Convert torch_moe_fused nodes to list-based torch_moe nodes before sharding.

classmethod get_config_class() → Type[TransformConfig][source]#

Get the configuration class for the transform.

This is used to validate the configuration of the transform.

YAML configuration

The fields below can be set under this transform’s entry in the AutoDeploy config YAML.

pydantic model tensorrt_llm._torch.auto_deploy.transform.library.fused_moe.SplitMoeFusedForShardingConfig[source]

Bases: TransformConfig

Configuration for converting torch_moe_fused to torch_moe pre-sharding.

Show JSON schema

{
   "title": "SplitMoeFusedForShardingConfig",
   "description": "Configuration for converting torch_moe_fused to torch_moe pre-sharding.",
   "type": "object",
   "properties": {
      "stage": {
         "$ref": "#/$defs/Stages",
         "description": "The stage of the transformation pipeline where this transform should run."
      },
      "run_per_gm": {
         "default": true,
         "description": "Whether to run the transform per graph (sub)module or on whole module.",
         "title": "Run Per Gm",
         "type": "boolean"
      },
      "enabled": {
         "default": true,
         "description": "Whether to enable this transform.",
         "title": "Enabled",
         "type": "boolean"
      },
      "skip_on_error": {
         "default": false,
         "description": "Whether to skip the transform if an error occurs.",
         "title": "Skip On Error",
         "type": "boolean"
      },
      "run_graph_cleanup": {
         "default": true,
         "description": "Whether to run graph cleanup/canonicalization after this transform.",
         "title": "Run Graph Cleanup",
         "type": "boolean"
      },
      "run_shape_prop": {
         "default": false,
         "description": "Whether to run shape propagation after this transform.",
         "title": "Run Shape Prop",
         "type": "boolean"
      },
      "requires_clean_graph": {
         "default": true,
         "description": "Whether this transform requires the graph to be clean before it is applied.",
         "title": "Requires Clean Graph",
         "type": "boolean"
      },
      "requires_shape_prop": {
         "default": false,
         "description": "Whether this transform requires shape propagation before it is applied.",
         "title": "Requires Shape Prop",
         "type": "boolean"
      },
      "debug_visualize_dir": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.",
         "title": "Debug Visualize Dir"
      },
      "expect_mem_change": {
         "default": false,
         "description": "Whether this transform is expected to cause changes in CUDA memory stats.",
         "title": "Expect Mem Change",
         "type": "boolean"
      }
   },
   "$defs": {
      "Stages": {
         "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.",
         "enum": [
            "factory",
            "export",
            "post_export",
            "pattern_matcher",
            "sharding",
            "weight_load",
            "post_load_fusion",
            "cache_init",
            "visualize",
            "compile"
         ],
         "title": "Stages",
         "type": "string"
      }
   },
   "additionalProperties": true,
   "required": [
      "stage"
   ]
}

Config:

extra: str = allow

Fields:

debug_visualize_dir (str | None)
enabled (bool)
expect_mem_change (bool)
requires_clean_graph (bool)
requires_shape_prop (bool)
run_graph_cleanup (bool)
run_per_gm (bool)
run_shape_prop (bool)
skip_on_error (bool)
stage (tensorrt_llm._torch.auto_deploy.transform.interface.Stages)

field debug_visualize_dir: str | None = None: Debug visualization directory. None to disable visualization, or a path string to specify the output directory.

field enabled: bool = True: Whether to enable this transform.

field expect_mem_change: bool = False: Whether this transform is expected to cause changes in CUDA memory stats.

field requires_clean_graph: bool = True: Whether this transform requires the graph to be clean before it is applied.

field requires_shape_prop: bool = False: Whether this transform requires shape propagation before it is applied.

field run_graph_cleanup: bool = True: Whether to run graph cleanup/canonicalization after this transform.

field run_per_gm: bool = True: Whether to run the transform per graph (sub)module or on whole module.

field run_shape_prop: bool = False: Whether to run shape propagation after this transform.

field skip_on_error: bool = False: Whether to skip the transform if an error occurs.

field stage: Stages [Required]: The stage of the transformation pipeline where this transform should run.

Match Repeat KV#

Transform key: match_repeat_kv

Source module: tensorrt_llm._torch.auto_deploy.transform.library.attention

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.attention.MatchRepeatKV( config: TransformConfig, )[source]#

Bases: BaseTransform

Match and replace the repeat_kv pattern with torch.ops.auto_deploy.torch_attention_repeat_kv.

YAML configuration

Uses the common TransformConfig fields documented in Core Transform APIs.

Match Eager Attention#

Transform key: match_eager_attention

Source module: tensorrt_llm._torch.auto_deploy.transform.library.attention

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.attention.MatchEagerAttention( config: TransformConfig, )[source]#

Bases: BaseTransform

Match and replace the eager attention pattern with torch.ops.auto_deploy.torch_attention_sdpa.

YAML configuration

Uses the common TransformConfig fields documented in Core Transform APIs.

Match Sdpa To Torch Attention#

Transform key: match_sdpa_to_torch_attention

Source module: tensorrt_llm._torch.auto_deploy.transform.library.attention

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.attention.MatchSDPAToTorchAttention( config: TransformConfig, )[source]#

Bases: BaseTransform

Match and replace SDPA patterns to torch.ops.auto_deploy.torch_attention.

This handles: - sdpa –> torch_attention - repeat_kv + sdpa –> torch_attention

This transform should run BEFORE match_repeat_kv_with_torch_attention to ensure SDPA calls are converted first.

YAML configuration

Uses the common TransformConfig fields documented in Core Transform APIs.

Match Grouped Attention#

Transform key: match_grouped_attention

Source module: tensorrt_llm._torch.auto_deploy.transform.library.attention

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.attention.MatchRepeatKVWithTorchAttention( config: TransformConfig, )[source]#

Bases: BaseTransform

Match and replace repeat_kv + torch_attention patterns to torch_attention.

This handles: - repeat_kv + torch_attention –> torch_attention (removes redundant repeat_kv) - torch_attention –> torch_attention (identity, catches any remaining patterns)

This transform should run AFTER match_sdpa_to_torch_attention to ensure we match the repeat_kv + torch_attention pattern correctly.

YAML configuration

Uses the common TransformConfig fields documented in Core Transform APIs.

Match Attention Layout#

Transform key: match_attention_layout

Source module: tensorrt_llm._torch.auto_deploy.transform.library.attention

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.attention.MatchAttentionLayout( config: TransformConfig, )[source]#

Bases: BaseTransform

Convert unified torch_attention calls from layout=’bnsd’ (explicit, positional or default) into layout=’bsnd’ + correct Q/K/V transposes, and transpose the output back to bnsd.

classmethod get_config_class() → Type[TransformConfig][source]#

Get the configuration class for the transform.

This is used to validate the configuration of the transform.

YAML configuration

The fields below can be set under this transform’s entry in the AutoDeploy config YAML.

pydantic model tensorrt_llm._torch.auto_deploy.transform.library.attention.MatchAttentionLayoutConfig[source]

Bases: TransformConfig

Configuration for the match attention layout transform.

Show JSON schema

{
   "title": "MatchAttentionLayoutConfig",
   "description": "Configuration for the match attention layout transform.",
   "type": "object",
   "properties": {
      "stage": {
         "$ref": "#/$defs/Stages",
         "description": "The stage of the transformation pipeline where this transform should run."
      },
      "run_per_gm": {
         "default": true,
         "description": "Whether to run the transform per graph (sub)module or on whole module.",
         "title": "Run Per Gm",
         "type": "boolean"
      },
      "enabled": {
         "default": true,
         "description": "Whether to enable this transform.",
         "title": "Enabled",
         "type": "boolean"
      },
      "skip_on_error": {
         "default": false,
         "description": "Whether to skip the transform if an error occurs.",
         "title": "Skip On Error",
         "type": "boolean"
      },
      "run_graph_cleanup": {
         "default": true,
         "description": "Whether to run graph cleanup/canonicalization after this transform.",
         "title": "Run Graph Cleanup",
         "type": "boolean"
      },
      "run_shape_prop": {
         "default": false,
         "description": "Whether to run shape propagation after this transform.",
         "title": "Run Shape Prop",
         "type": "boolean"
      },
      "requires_clean_graph": {
         "default": true,
         "description": "Whether this transform requires the graph to be clean before it is applied.",
         "title": "Requires Clean Graph",
         "type": "boolean"
      },
      "requires_shape_prop": {
         "default": false,
         "description": "Whether this transform requires shape propagation before it is applied.",
         "title": "Requires Shape Prop",
         "type": "boolean"
      },
      "debug_visualize_dir": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.",
         "title": "Debug Visualize Dir"
      },
      "expect_mem_change": {
         "default": false,
         "description": "Whether this transform is expected to cause changes in CUDA memory stats.",
         "title": "Expect Mem Change",
         "type": "boolean"
      },
      "attn_layout": {
         "description": "Layout expected by the attention backend.",
         "enum": [
            "bsnd",
            "bnsd"
         ],
         "title": "Attn Layout",
         "type": "string"
      }
   },
   "$defs": {
      "Stages": {
         "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.",
         "enum": [
            "factory",
            "export",
            "post_export",
            "pattern_matcher",
            "sharding",
            "weight_load",
            "post_load_fusion",
            "cache_init",
            "visualize",
            "compile"
         ],
         "title": "Stages",
         "type": "string"
      }
   },
   "additionalProperties": true,
   "required": [
      "stage",
      "attn_layout"
   ]
}

Config:

extra: str = allow

Fields:

attn_layout (Literal['bsnd', 'bnsd'])

field attn_layout: Literal['bsnd', 'bnsd'] [Required]: Layout expected by the attention backend.

Match RoPE Pattern#

Transform key: match_rope_pattern

Source module: tensorrt_llm._torch.auto_deploy.transform.library.rope

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.rope.MatchRopePattern( config: TransformConfig, )[source]#: Bases: BaseTransform

YAML configuration

Uses the common TransformConfig fields documented in Core Transform APIs.

Match RoPE Layout#

Transform key: match_rope_layout

Source module: tensorrt_llm._torch.auto_deploy.transform.library.rope

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.rope.MatchRopeLayout( config: TransformConfig, )[source]#

Bases: BaseTransform

Match and transform input and output of rope ops to the layout specified to meet requirements of optimized ops. Supported layout is ‘bsnd’ (batch, seq, head, dim).

classmethod get_config_class() → Type[TransformConfig][source]#

Get the configuration class for the transform.

This is used to validate the configuration of the transform.

YAML configuration

The fields below can be set under this transform’s entry in the AutoDeploy config YAML.

pydantic model tensorrt_llm._torch.auto_deploy.transform.library.rope.MatchRopeLayoutConfig[source]

Bases: TransformConfig

Configuration for the match rope layout transform.

Show JSON schema

{
   "title": "MatchRopeLayoutConfig",
   "description": "Configuration for the match rope layout transform.",
   "type": "object",
   "properties": {
      "stage": {
         "$ref": "#/$defs/Stages",
         "description": "The stage of the transformation pipeline where this transform should run."
      },
      "run_per_gm": {
         "default": true,
         "description": "Whether to run the transform per graph (sub)module or on whole module.",
         "title": "Run Per Gm",
         "type": "boolean"
      },
      "enabled": {
         "default": true,
         "description": "Whether to enable this transform.",
         "title": "Enabled",
         "type": "boolean"
      },
      "skip_on_error": {
         "default": false,
         "description": "Whether to skip the transform if an error occurs.",
         "title": "Skip On Error",
         "type": "boolean"
      },
      "run_graph_cleanup": {
         "default": true,
         "description": "Whether to run graph cleanup/canonicalization after this transform.",
         "title": "Run Graph Cleanup",
         "type": "boolean"
      },
      "run_shape_prop": {
         "default": false,
         "description": "Whether to run shape propagation after this transform.",
         "title": "Run Shape Prop",
         "type": "boolean"
      },
      "requires_clean_graph": {
         "default": true,
         "description": "Whether this transform requires the graph to be clean before it is applied.",
         "title": "Requires Clean Graph",
         "type": "boolean"
      },
      "requires_shape_prop": {
         "default": false,
         "description": "Whether this transform requires shape propagation before it is applied.",
         "title": "Requires Shape Prop",
         "type": "boolean"
      },
      "debug_visualize_dir": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.",
         "title": "Debug Visualize Dir"
      },
      "expect_mem_change": {
         "default": false,
         "description": "Whether this transform is expected to cause changes in CUDA memory stats.",
         "title": "Expect Mem Change",
         "type": "boolean"
      },
      "expected_layout": {
         "default": "bsnd",
         "description": "The expected layout of the rope operation. Must be one of 'bsnd' or 'bnsd'.",
         "title": "Expected Layout",
         "type": "string"
      }
   },
   "$defs": {
      "Stages": {
         "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.",
         "enum": [
            "factory",
            "export",
            "post_export",
            "pattern_matcher",
            "sharding",
            "weight_load",
            "post_load_fusion",
            "cache_init",
            "visualize",
            "compile"
         ],
         "title": "Stages",
         "type": "string"
      }
   },
   "additionalProperties": true,
   "required": [
      "stage"
   ]
}

Config:

extra: str = allow

Fields:

expected_layout (str)

field expected_layout: str = 'bsnd': The expected layout of the rope operation. Must be one of ‘bsnd’ or ‘bnsd’.

Match RMSNorm Pattern#

Transform key: match_rmsnorm_pattern

Source module: tensorrt_llm._torch.auto_deploy.transform.library.rms_norm

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.rms_norm.MatchRMSNormPattern( config: TransformConfig, )[source]#

Bases: BaseTransform

Matches RMSNorm patterns in the graph and replaces them with torch_rmsnorm op.

This transform runs in the pattern_matcher stage and standardizes RMSNorm patterns to use torch_rmsnorm op, which can later be fused to a specific backend in the post_load_fusion stage.

Parameters:: gm – Input graph module to transform.
Returns:: Transformed graph module with standardized torch_rmsnorm operations.

classmethod get_config_class() → Type[TransformConfig][source]#

Get the configuration class for the transform.

This is used to validate the configuration of the transform.

YAML configuration

Uses the common TransformConfig fields documented in Core Transform APIs.

Match L2Norm Pattern#

Transform key: match_l2norm_pattern

Source module: tensorrt_llm._torch.auto_deploy.transform.library.l2_norm

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.l2_norm.MatchL2NormPattern( config: TransformConfig, )[source]#

Bases: BaseTransform

Matches L2Norm patterns in the graph and replaces them with torch_l2norm op.

This transform runs in the pattern_matcher stage and standardizes L2Norm patterns to use torch_l2norm op, which can later be fused to a specific backend in the post_load_fusion stage.

Parameters:: gm – Input graph module to transform.
Returns:: Transformed graph module with standardized torch_l2norm operations.

YAML configuration

Uses the common TransformConfig fields documented in Core Transform APIs.

Match MoE Routing Pattern#

Transform key: match_moe_routing_pattern

Source module: tensorrt_llm._torch.auto_deploy.transform.library.moe_routing

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.moe_routing.MatchMoeRoutingPattern( config: TransformConfig, )[source]#

Bases: BaseTransform

Match softmax → topk → renormalize and replace with a fused Triton op.

This transform detects the 3-op MoE routing pattern:

routing_weights = softmax(logits, dtype=float32)
routing_weights, indices = topk(routing_weights, k)
routing_weights /= routing_weights.sum(keepdim=True)

and replaces it with:

routing_weights, indices = triton_fused_topk_softmax(logits, k)

The fused kernel exploits the equivalence topk(softmax(x)) / Σ ≡ softmax(topk(x)) and avoids computing softmax over all experts.

classmethod get_config_class() → Type[TransformConfig][source]#

Get the configuration class for the transform.

This is used to validate the configuration of the transform.

YAML configuration

Uses the common TransformConfig fields documented in Core Transform APIs.

Eliminate Redundant Transposes#

Transform key: eliminate_redundant_transposes

Source module: tensorrt_llm._torch.auto_deploy.transform.library.eliminate_redundant_transposes

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.eliminate_redundant_transposes.EliminateRedundantTransposes( config: TransformConfig, )[source]#

Bases: BaseTransform

Eliminate redundant transpose operations in the graph.

This transformation identifies pairs of consecutive transpose operations with the same dimension arguments and removes both operations, as they cancel out.

YAML configuration

Uses the common TransformConfig fields documented in Core Transform APIs.

Quantize Int4 Linear From Config#

Transform key: quantize_int4_linear_from_config

Source module: tensorrt_llm._torch.auto_deploy.transform.library.quantization

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.quantization.INT4LinearQuantizationFromConfig( config: TransformConfig, )[source]#

Bases: Quantization

Config-based INT4 (AWQ) for the unified ModelOpt checkpoints.

static target_op()[source]#: Returns the target quantization ops.

static quantize_weight( original_weight: Tensor, ) → Tensor[source]#: Returns the quantized weight from the original unquantized weight.

static scale_names() → List[str][source]#: Returns the list of names of the scales for this quantization.

static default_scales( original_weight_shape: Tuple, ) → Dict[str, Tensor][source]#: Returns a dict of the default scale values for this quantization.

static load_hook( state_dict, prefix, *args, weight_name: str, )[source]#

Unified ckpt passthrough:

weight: keep packed uint8 (N//2, K)
pre_quant_scale buffer: (K,) or ones(K) if missing
weight_scale buffer: (N, K//128) float32 (no reshape, no *7 here)

YAML configuration

Uses the common TransformConfig fields documented in Core Transform APIs.

Quantize Int4 Gptq Linear From Config#

Transform key: quantize_int4_gptq_linear_from_config

Source module: tensorrt_llm._torch.auto_deploy.transform.library.quantization

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.quantization.INT4GPTQLinearQuantizationFromConfig( config: TransformConfig, )[source]#

Bases: Quantization

Config-based INT4 GPTQ quantization for GPTQ-quantized checkpoints.

GPTQ uses:

qweight: [K/8, N] int32 (8 packed int4 values per int32)
qzeros: [G, N/8] int32 (packed zero points)
scales: [G, N] float (per-group scales)

static target_op()[source]#: Returns the target quantization ops.

static quantize_weight( original_weight: Tensor, ) → Tensor[source]#: Returns placeholder qweight tensor [K/8, N] int32.

static scale_names() → List[str][source]#: Returns the list of names of the scales for this quantization.

static default_scales( original_weight_shape: Tuple, ) → Dict[str, Tensor][source]#: Returns placeholder tensors for GPTQ scales and qzeros.

static build_custom_args_for_linear( scales: Dict[str, Node], ) → Tuple[object, ...][source]#: Build args for torch_fake_quant_int4_gptq_linear: (input, weight, bias, input_scale, weight_scale, input_zp, weight_zp) -> input_scale=[], weight_scale=[scales], input_zp=[], weight_zp=[qzeros]

static load_hook( state_dict, prefix, *args, weight_name: str, )[source]#

Load hook for GPTQ checkpoints:

qweight: keep as [K/8, N] int32
scales: [G, N] float16
qzeros: [G, N/8] int32

GPTQ checkpoint uses naming convention:

{prefix}qweight
{prefix}scales
{prefix}qzeros

YAML configuration

Uses the common TransformConfig fields documented in Core Transform APIs.

Quantize FP8 Linear From Config#

Transform key: quantize_fp8_linear_from_config

Source module: tensorrt_llm._torch.auto_deploy.transform.library.quantization

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.quantization.FP8LinearQuantizationFromConfig( config: TransformConfig, )[source]#

Bases: Quantization

target_op()[source]#: Returns the target quantization ops.

quantize_weight( w: Tensor, ) → Tensor[source]#: Returns the quantized weight from the original unquantized weight.

scale_names() → List[str][source]#: Returns the list of names of the scales for this quantization.

default_scales( _shape: Tuple, ) → Dict[str, Tensor][source]#: Returns a dict of the default scale values for this quantization.

load_hook( state_dict, prefix, *args, weight_name, )[source]#: Load hook for state_dict quantization pre-processing.

convert_amax_hook( state_dict, prefix, *args, scale_name: str, amax_name: str, )[source]#: Convert amax from modelopt quantized graph to scales.

YAML configuration

Uses the common TransformConfig fields documented in Core Transform APIs.

Quantize NVFP4 Linear From Config#

Transform key: quantize_nvfp4_linear_from_config

Source module: tensorrt_llm._torch.auto_deploy.transform.library.quantization

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.quantization.NVFP4LinearQuantizationFromConfig( config: TransformConfig, )[source]#

Bases: Quantization

target_op()[source]#: Returns the target quantization ops.

quantize_weight( w: Tensor, ) → Tensor[source]#: Returns the quantized weight from the original unquantized weight.

scale_names() → List[str][source]#: Returns the list of names of the scales for this quantization.

default_scales( original_weight_shape: Tuple, ) → Dict[str, Tensor][source]#: Returns a dict of the default scale values for this quantization.

load_hook( state_dict, prefix, *args, weight_name, )[source]#: Load hook for state_dict quantization pre-processing.

convert_amax_hook( state_dict, prefix, *args, scale_name: str, amax_name: str, )[source]#: Convert amax from modelopt quantized graph to scales.

YAML configuration

Uses the common TransformConfig fields documented in Core Transform APIs.

Quantize Finegrained FP8 Linear From Config#

Transform key: quantize_finegrained_fp8_linear_from_config

Source module: tensorrt_llm._torch.auto_deploy.transform.library.quantization

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.quantization.FineGrainedFP8LinearQuantization( config: TransformConfig, )[source]#

Bases: Quantization

Quantization transform for FineGrainedFP8 (block-wise FP8) models.

This transform replaces linear ops with the FineGrainedFP8 quantized op. The FineGrained FP8 format uses per-block weight scales (weight_scale_inv) and dynamic input quantization.

Config format (from HF config.json):

“quantization_config”: {: “quant_method”: “fp8”, “weight_block_size”: [128, 128], “modules_to_not_convert”: [“lm_head”]

}

target_op()[source]#: Returns the target quantization ops.

quantize_weight( w: Tensor, ) → Tensor[source]#: Returns the quantized weight from the original unquantized weight.

scale_names() → List[str][source]#: Returns the list of names of the scales for this quantization.

default_scales( original_weight_shape: Tuple, ) → Dict[str, Tensor][source]#: Returns a dict of the default scale values for this quantization.

load_hook( state_dict, prefix, *args, weight_name: str, )[source]#

Load hook to handle FineGrainedFP8 checkpoint format.

FineGrained FP8 checkpoints store: - weight: float8_e4m3fn tensor - weight_scale_inv: per-block scale tensor

YAML configuration

Uses the common TransformConfig fields documented in Core Transform APIs.

Quantize Finegrained FP8 MoE#

Transform key: quantize_finegrained_fp8_moe

Source module: tensorrt_llm._torch.auto_deploy.transform.library.quantize_moe

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.quantize_moe.QuantizeFineGrainedFP8MOE( config: TransformConfig, )[source]#

Bases: Quantization

Traverse gm, find every torch.ops.auto_deploy.torch_moe, and replace it with the FineGrainedFP8 quantized version.

This transform handles FineGrained FP8 quantization config format:

“quantization_config”: {: “quant_method”: “fp8”, “weight_block_size”: [128, 128], “modules_to_not_convert”: [“gate”, “lm_head”]

}

target_op()[source]#: Returns the target quantization ops.

quantize_weight( w: Tensor, ) → Tensor[source]#: Returns the quantized weight from the original unquantized weight.

scale_names() → List[str][source]#: Returns the list of names of the scales for this quantization.

default_scales( original_weight_shape: Tuple, ) → Dict[str, Tensor][source]#: Returns a dict of the default scale values for this quantization.

load_hook( state_dict, prefix, *args, weight_name: str, )[source]#: Load hook to handle HF FineGrainedFP8 checkpoint format.

YAML configuration

Uses the common TransformConfig fields documented in Core Transform APIs.

Quantize FP8 Bmm From Config#

Transform key: quantize_fp8_bmm_from_config

Source module: tensorrt_llm._torch.auto_deploy.transform.library.quantization

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.quantization.FP8BMMQuantizationFromConfig( config: TransformConfig, )[source]#

Bases: Quantization

target_op()[source]#: Returns the target quantization ops.

quantize_weight( w: Tensor, ) → Tensor[source]#: Returns the quantized weight from the original unquantized weight.

scale_names() → List[str][source]#: Returns the list of names of the scales for this quantization.

default_scales( _shape: Tuple, ) → Dict[str, Tensor][source]#: Returns a dict of the default scale values for this quantization.

load_hook( state_dict, prefix, *args, weight_name, )[source]#: Pre-hook: Only handle quantization.

post_load_hook( module, incompatible_keys, weight_name, )[source]#: Post-hook: Handle column-major conversion after parameter is loaded.

YAML configuration

Uses the common TransformConfig fields documented in Core Transform APIs.

Quantize FP8 From Graph#

Transform key: quantize_fp8_from_graph

Source module: tensorrt_llm._torch.auto_deploy.transform.library.quantization

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.quantization.FP8QuantizationFromGraph( config: TransformConfig, )[source]#

Bases: FP8LinearQuantizationFromConfig

Fuse ModelOpt-quantized FP8 linears into fused ops.

YAML configuration

Uses the common TransformConfig fields documented in Core Transform APIs.

Quantize NVFP4 From Graph#

Transform key: quantize_nvfp4_from_graph

Source module: tensorrt_llm._torch.auto_deploy.transform.library.quantization

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.quantization.NVFP4QuantizationFromGraph( config: TransformConfig, )[source]#

Bases: NVFP4LinearQuantizationFromConfig

Fuse ModelOpt-quantized NVFP4 linears into fused ops.

YAML configuration

Uses the common TransformConfig fields documented in Core Transform APIs.

Match SwiGLU Pattern#

Transform key: match_swiglu_pattern

Source module: tensorrt_llm._torch.auto_deploy.transform.library.fuse_swiglu

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.fuse_swiglu.MatchSwiGLUPattern( config: TransformConfig, )[source]#

Bases: BaseTransform

Matches SwiGLU MLP patterns and replaces with torch_swiglu_mlp op.

This transform runs in the pattern_matcher stage and detects the following pattern:: silu(x @ gate.T) * (x @ up.T) @ down.T

And replaces it with a single torch_swiglu_mlp op that can be fused later.

Uses ADPatternMatcherPass for declarative pattern matching.

classmethod get_config_class() → Type[TransformConfig][source]#

Get the configuration class for the transform.

This is used to validate the configuration of the transform.

YAML configuration

Uses the common TransformConfig fields documented in Core Transform APIs.

Match NVFP4 SwiGLU Pattern#

Transform key: match_nvfp4_swiglu_pattern

Source module: tensorrt_llm._torch.auto_deploy.transform.library.fuse_swiglu

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.fuse_swiglu.MatchNVFP4SwiGLUPattern( config: TransformConfig, )[source]#

Bases: BaseTransform

Matches NVFP4 quantized SwiGLU MLP patterns and replaces with torch_nvfp4_swiglu_mlp.

This transform runs in the pattern_matcher stage AFTER quantize_nvfp4_linear_from_config has converted torch_linear_simple ops to torch_fake_quant_nvfp4_linear ops.

It detects the following NVFP4 pattern:: silu(nvfp4_linear(x, gate)) * nvfp4_linear(x, up) -> nvfp4_linear(down)

And replaces it with a single torch_nvfp4_swiglu_mlp op that can be fused later.

classmethod get_config_class() → Type[TransformConfig][source]#

Get the configuration class for the transform.

This is used to validate the configuration of the transform.

YAML configuration

Uses the common TransformConfig fields documented in Core Transform APIs.

Match Finegrained FP8 SwiGLU Pattern#

Transform key: match_finegrained_fp8_swiglu_pattern

Source module: tensorrt_llm._torch.auto_deploy.transform.library.fuse_swiglu

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.fuse_swiglu.MatchFineGrainedFP8SwiGLUPattern( config: TransformConfig, )[source]#

Bases: BaseTransform

Matches FineGrained FP8 quantized SwiGLU MLP patterns.

This transform runs in the pattern_matcher stage AFTER quantize_finegrained_fp8_linear_from_config has converted torch_linear_simple ops to torch_fake_quant_finegrained_fp8_linear ops.

It detects the following FineGrained FP8 pattern:: silu(fp8_linear(x, gate)) * fp8_linear(x, up) -> fp8_linear(down)

And replaces it with a single torch_finegrained_fp8_swiglu_mlp op that can be fused later.

Note: This transform runs before sharding. The composite SwiGLU op will NOT be sharded by the sharding transform. Enable only when sharding is not needed or when sharding-aware handling is added separately.

classmethod get_config_class() → Type[TransformConfig][source]#

Get the configuration class for the transform.

This is used to validate the configuration of the transform.

YAML configuration

Uses the common TransformConfig fields documented in Core Transform APIs.

Quantize FP8 MoE#

Transform key: quantize_fp8_moe

Source module: tensorrt_llm._torch.auto_deploy.transform.library.quantize_moe

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.quantize_moe.QuantizeFP8MOE( config: TransformConfig, )[source]#

Bases: FP8LinearQuantizationFromConfig

Traverse gm, find every torch.ops.auto_deploy.torch_moe, and replace it with the quantized version using the quant_algo from quant_config.

target_op()[source]#: Returns the target quantization ops.

YAML configuration

Uses the common TransformConfig fields documented in Core Transform APIs.

Quantize NVFP4 MoE#

Transform key: quantize_nvfp4_moe

Source module: tensorrt_llm._torch.auto_deploy.transform.library.quantize_moe

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.quantize_moe.QuantizeNVFP4MOE( config: TransformConfig, )[source]#

Bases: NVFP4LinearQuantizationFromConfig

Traverse gm, find every torch.ops.auto_deploy.torch_moe, and replace it with the quantized version using the quant_algo from quant_config.

target_op()[source]#: Returns the target quantization ops.

YAML configuration

Uses the common TransformConfig fields documented in Core Transform APIs.

Quantize MXFP4 MoE#

Transform key: quantize_mxfp4_moe

Source module: tensorrt_llm._torch.auto_deploy.transform.library.fused_moe_mxfp4

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.fused_moe_mxfp4.QuantizeMXFP4MOE( config: TransformConfig, )[source]#

Bases: BaseTransform

Quantize MXFP4 MoE: dispatch to triton or trtllm-gen backend.

Replaces (torch_moe_router -> torch_moe_dense_mlp) with a single fused MoE op. The chosen backend determines the destination op and the parameter layout registered on the experts module:

backend="triton" → auto_deploy::triton_mxfp4_moe with raw HF MXFP4 layout (_blocks / _scales / _bias). Lazy weight swizzling happens inside the Triton kernel on first forward.
backend="trtllm" → auto_deploy::trtllm_quant_mxfp4_*_moe_fused with trtllm-gen prepared layout (fc1_w_trtllm / fc1_w_scale_trtllm / …). Weight preparation (shuffle + interleave) is done on CPU inside a state-dict pre-hook registered by this transform, so the raw HF tensors are converted before being moved to GPU.

classmethod get_config_class() → Type[TransformConfig][source]#

Get the configuration class for the transform.

This is used to validate the configuration of the transform.

YAML configuration

The fields below can be set under this transform’s entry in the AutoDeploy config YAML.

pydantic model tensorrt_llm._torch.auto_deploy.transform.library.fused_moe_mxfp4.QuantizeMXFP4MOEConfig[source]

Bases: TransformConfig

Configuration for quantize_mxfp4_moe.

Show JSON schema

{
   "title": "QuantizeMXFP4MOEConfig",
   "description": "Configuration for ``quantize_mxfp4_moe``.",
   "type": "object",
   "properties": {
      "stage": {
         "$ref": "#/$defs/Stages",
         "description": "The stage of the transformation pipeline where this transform should run."
      },
      "run_per_gm": {
         "default": true,
         "description": "Whether to run the transform per graph (sub)module or on whole module.",
         "title": "Run Per Gm",
         "type": "boolean"
      },
      "enabled": {
         "default": true,
         "description": "Whether to enable this transform.",
         "title": "Enabled",
         "type": "boolean"
      },
      "skip_on_error": {
         "default": false,
         "description": "Whether to skip the transform if an error occurs.",
         "title": "Skip On Error",
         "type": "boolean"
      },
      "run_graph_cleanup": {
         "default": true,
         "description": "Whether to run graph cleanup/canonicalization after this transform.",
         "title": "Run Graph Cleanup",
         "type": "boolean"
      },
      "run_shape_prop": {
         "default": false,
         "description": "Whether to run shape propagation after this transform.",
         "title": "Run Shape Prop",
         "type": "boolean"
      },
      "requires_clean_graph": {
         "default": true,
         "description": "Whether this transform requires the graph to be clean before it is applied.",
         "title": "Requires Clean Graph",
         "type": "boolean"
      },
      "requires_shape_prop": {
         "default": false,
         "description": "Whether this transform requires shape propagation before it is applied.",
         "title": "Requires Shape Prop",
         "type": "boolean"
      },
      "debug_visualize_dir": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.",
         "title": "Debug Visualize Dir"
      },
      "expect_mem_change": {
         "default": false,
         "description": "Whether this transform is expected to cause changes in CUDA memory stats.",
         "title": "Expect Mem Change",
         "type": "boolean"
      },
      "backend": {
         "anyOf": [
            {
               "enum": [
                  "triton",
                  "trtllm"
               ],
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "MXFP4 MoE kernel backend selection. When unset (``None``), the default is SM-based: ``trtllm`` on SM>=100 (Blackwell), ``triton`` otherwise. Explicit ``triton`` or ``trtllm`` overrides the default. ``trtllm`` on SM<100 silently falls back to ``triton`` with a warning.",
         "title": "Backend"
      },
      "trtllm_quant_act": {
         "default": "mxfp8",
         "description": "Only used when ``backend='trtllm'``. Activation precision for the trtllm-gen MoE GEMM, passed as ``act_dtype`` to ``trtllm_quant_mxfp4_trtllm_gen_moe_fused``: ``bf16`` dispatches to the bf16 MoE runner (W4A16), ``mxfp8`` pre-quantizes the activation to MXFP8 and dispatches to the MXFP8 MoE runner (W4A8, faster cubin family). Default ``mxfp8`` matches the modeling-side default.",
         "enum": [
            "bf16",
            "mxfp8"
         ],
         "title": "Trtllm Quant Act",
         "type": "string"
      }
   },
   "$defs": {
      "Stages": {
         "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.",
         "enum": [
            "factory",
            "export",
            "post_export",
            "pattern_matcher",
            "sharding",
            "weight_load",
            "post_load_fusion",
            "cache_init",
            "visualize",
            "compile"
         ],
         "title": "Stages",
         "type": "string"
      }
   },
   "additionalProperties": true,
   "required": [
      "stage"
   ]
}

Config:

extra: str = allow

Fields:

backend (Literal['triton', 'trtllm'] | None)
trtllm_quant_act (Literal['bf16', 'mxfp8'])

field backend: Literal['triton', 'trtllm'] | None = None: MXFP4 MoE kernel backend selection. When unset (None), the default is SM-based: trtllm on SM>=100 (Blackwell), triton otherwise. Explicit triton or trtllm overrides the default. trtllm on SM<100 silently falls back to triton with a warning.

field trtllm_quant_act: Literal['bf16', 'mxfp8'] = 'mxfp8': Only used when backend='trtllm'. Activation precision for the trtllm-gen MoE GEMM, passed as act_dtype to trtllm_quant_mxfp4_trtllm_gen_moe_fused: bf16 dispatches to the bf16 MoE runner (W4A16), mxfp8 pre-quantizes the activation to MXFP8 and dispatches to the MXFP8 MoE runner (W4A8, faster cubin family). Default mxfp8 matches the modeling-side default.

Detect Hidden States For Capture#

Transform key: detect_hidden_states_for_capture

Source module: tensorrt_llm._torch.auto_deploy.transform.library.hidden_states

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.hidden_states.DetectHiddenStatesForCapture( config: TransformConfig, )[source]#

Bases: BaseTransform

Detect the hidden states we should capture in the graph.

classmethod get_config_class() → Type[TransformConfig][source]#

Get the configuration class for the transform.

This is used to validate the configuration of the transform.

YAML configuration

The fields below can be set under this transform’s entry in the AutoDeploy config YAML.

pydantic model tensorrt_llm._torch.auto_deploy.transform.library.hidden_states.DetectHiddenStatesForCaptureConfig[source]

Bases: TransformConfig

Configuration for the hidden states detection transform.

Show JSON schema

{
   "title": "DetectHiddenStatesForCaptureConfig",
   "description": "Configuration for the hidden states detection transform.",
   "type": "object",
   "properties": {
      "stage": {
         "$ref": "#/$defs/Stages",
         "description": "The stage of the transformation pipeline where this transform should run."
      },
      "run_per_gm": {
         "default": true,
         "description": "Whether to run the transform per graph (sub)module or on whole module.",
         "title": "Run Per Gm",
         "type": "boolean"
      },
      "enabled": {
         "default": true,
         "description": "Whether to enable this transform.",
         "title": "Enabled",
         "type": "boolean"
      },
      "skip_on_error": {
         "default": false,
         "description": "Whether to skip the transform if an error occurs.",
         "title": "Skip On Error",
         "type": "boolean"
      },
      "run_graph_cleanup": {
         "default": true,
         "description": "Whether to run graph cleanup/canonicalization after this transform.",
         "title": "Run Graph Cleanup",
         "type": "boolean"
      },
      "run_shape_prop": {
         "default": false,
         "description": "Whether to run shape propagation after this transform.",
         "title": "Run Shape Prop",
         "type": "boolean"
      },
      "requires_clean_graph": {
         "default": true,
         "description": "Whether this transform requires the graph to be clean before it is applied.",
         "title": "Requires Clean Graph",
         "type": "boolean"
      },
      "requires_shape_prop": {
         "default": false,
         "description": "Whether this transform requires shape propagation before it is applied.",
         "title": "Requires Shape Prop",
         "type": "boolean"
      },
      "debug_visualize_dir": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.",
         "title": "Debug Visualize Dir"
      },
      "expect_mem_change": {
         "default": false,
         "description": "Whether this transform is expected to cause changes in CUDA memory stats.",
         "title": "Expect Mem Change",
         "type": "boolean"
      },
      "eagle3_layers_to_capture": {
         "anyOf": [
            {
               "items": {
                  "type": "integer"
               },
               "type": "array",
               "uniqueItems": true
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "title": "Eagle3 Layers To Capture"
      }
   },
   "$defs": {
      "Stages": {
         "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.",
         "enum": [
            "factory",
            "export",
            "post_export",
            "pattern_matcher",
            "sharding",
            "weight_load",
            "post_load_fusion",
            "cache_init",
            "visualize",
            "compile"
         ],
         "title": "Stages",
         "type": "string"
      }
   },
   "additionalProperties": true,
   "required": [
      "stage"
   ]
}

Config:

extra: str = allow

Fields:

eagle3_layers_to_capture (Set[int] | None)

field eagle3_layers_to_capture: Set[int] | None = None

set_default_eagle3_layers_to_capture( num_hidden_layers: int, )[source]: Used to set default layers to capture when we want to capture hidden states, but no layers to capture are provided.