Compilation Stage#

Compilation is the final transform stage before execution. It applies runtime-oriented and compiler-oriented changes after graph structure, weights, and caches are ready, such as multi-stream kernels, final cleanup, and CUDA graph or torch.compile execution.

Fuse Causal Conv Activation#

Transform key: fuse_causal_conv_activation

Source module: tensorrt_llm._torch.auto_deploy.transform.library.fuse_causal_conv

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.fuse_causal_conv.FuseCausalConvActivation(
config: TransformConfig,
)[source]#

Bases: BaseTransform

Fuses activation functions into cached CUDA causal_conv1d operations.

This transform detects patterns like:

conv_out = cuda_cached_causal_conv1d(…) out = silu(conv_out)

And replaces them with:

out = cuda_cached_causal_conv1d(…, activation=”silu”)

This optimization allows the backend CUDA kernels to fuse the activation, reducing memory bandwidth and improving performance.

Note: This runs AFTER insert_cached_causal_conv, so it operates on the cached CUDA operations, not the uncached torch operations.

YAML configuration

Uses the common TransformConfig fields documented in Core Transform APIs.

Multi Stream MoE#

Transform key: multi_stream_moe

Source module: tensorrt_llm._torch.auto_deploy.transform.library.multi_stream_moe

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.multi_stream_moe.MultiStreamMOE(
config: TransformConfig,
)[source]#

Bases: BaseTransform

Multi-stream execution of MoE layers that have shared experts and routed experts.

YAML configuration

Uses the common TransformConfig fields documented in Core Transform APIs.

Multi Stream MLA Attn#

Transform key: multi_stream_mla_attn

Source module: tensorrt_llm._torch.auto_deploy.transform.library.multi_stream_attn

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.multi_stream_attn.MultiStreamMLAAttn(
config: TransformConfig,
)[source]#

Bases: BaseTransform

Multi-stream Q/KV parallelism for MLA attention blocks.

Pattern 0: Full KV path overlap for unfused Q/KV GEMMs (begin/end aux). Pattern 1: Overlaps KV projection linear with Q projection chain (fallback).

Pattern 0 is tried first; if it matches (unfused graph), pattern 1 is skipped. If pattern 0 finds nothing (fused graph), pattern 1 runs as fallback.

YAML configuration

Uses the common TransformConfig fields documented in Core Transform APIs.

Multi Stream Gemm#

Transform key: multi_stream_gemm

Source module: tensorrt_llm._torch.auto_deploy.transform.library.multi_stream_gemm

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.multi_stream_gemm.MultiStreamGemm(
config: TransformConfig,
)[source]#

Bases: BaseTransform

Multi-stream parallelization of fp8 GEMMs sharing the same input.

For each fork point where 2+ fp8 linear ops share the same input tensor, the largest GEMM (by weight shape) is moved to the auxiliary CUDA stream so it executes concurrently with the remaining GEMMs on the main stream.

YAML configuration

Uses the common TransformConfig fields documented in Core Transform APIs.

Cleanup Identity Dtype Cast#

Transform key: cleanup_identity_dtype_cast

Source module: tensorrt_llm._torch.auto_deploy.transform.library.cleanup_identity_dtype_cast

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.cleanup_identity_dtype_cast.CleanupIdentityDtypeCast(
config: TransformConfig,
)[source]#

Bases: BaseTransform

Remove identity dtype casts where input dtype already matches the target dtype.

Handles the three dtype-cast spellings commonly produced by tracing / functionalization:

  • aten.to.dtype(self, dtype, ...) — explicit .to(dtype) calls.

  • aten._to_copy.default(self, dtype=..., ...) — functionalized form that always materializes a copy.

  • prims.convert_element_type.default(self, dtype) — canonical torch. compile / torch.export primitive for dtype conversion.

Elimination only proceeds when the cast is semantically an identity: source dtype equals target dtype and no other observable attribute (copy flag, layout, device, pin_memory, memory_format) diverges from the input.

YAML configuration

Uses the common TransformConfig fields documented in Core Transform APIs.

Compile Model#

Transform key: compile_model

Source module: tensorrt_llm._torch.auto_deploy.transform.library.compile_model

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.compile_model.CompileModel(
config: TransformConfig,
)[source]#

Bases: BaseTransform

A transform to compile the model.

classmethod get_config_class() Type[TransformConfig][source]#

Get the configuration class for the transform.

This is used to validate the configuration of the transform.

YAML configuration

The fields below can be set under this transform’s entry in the AutoDeploy config YAML.

pydantic model tensorrt_llm._torch.auto_deploy.transform.library.compile_model.CompileModelConfig[source]

Bases: TransformConfig

Configuration for the compile model transform.

Show JSON schema
{
   "title": "CompileModelConfig",
   "description": "Configuration for the compile model transform.",
   "type": "object",
   "properties": {
      "stage": {
         "$ref": "#/$defs/Stages",
         "description": "The stage of the transformation pipeline where this transform should run."
      },
      "run_per_gm": {
         "default": true,
         "description": "Whether to run the transform per graph (sub)module or on whole module.",
         "title": "Run Per Gm",
         "type": "boolean"
      },
      "enabled": {
         "default": true,
         "description": "Whether to enable this transform.",
         "title": "Enabled",
         "type": "boolean"
      },
      "skip_on_error": {
         "default": false,
         "description": "Whether to skip the transform if an error occurs.",
         "title": "Skip On Error",
         "type": "boolean"
      },
      "run_graph_cleanup": {
         "default": true,
         "description": "Whether to run graph cleanup/canonicalization after this transform.",
         "title": "Run Graph Cleanup",
         "type": "boolean"
      },
      "run_shape_prop": {
         "default": false,
         "description": "Whether to run shape propagation after this transform.",
         "title": "Run Shape Prop",
         "type": "boolean"
      },
      "requires_clean_graph": {
         "default": true,
         "description": "Whether this transform requires the graph to be clean before it is applied.",
         "title": "Requires Clean Graph",
         "type": "boolean"
      },
      "requires_shape_prop": {
         "default": false,
         "description": "Whether this transform requires shape propagation before it is applied.",
         "title": "Requires Shape Prop",
         "type": "boolean"
      },
      "debug_visualize_dir": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.",
         "title": "Debug Visualize Dir"
      },
      "expect_mem_change": {
         "default": false,
         "description": "Whether this transform is expected to cause changes in CUDA memory stats.",
         "title": "Expect Mem Change",
         "type": "boolean"
      },
      "cuda_graph_batch_sizes": {
         "anyOf": [
            {
               "items": {
                  "type": "integer"
               },
               "type": "array"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "The batch sizes to use for CUDA graphs.",
         "title": "Cuda Graph Batch Sizes"
      },
      "num_batched_inputs": {
         "anyOf": [
            {
               "minimum": 1,
               "type": "integer"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "The number of batched inputs to use for CUDA graphs. If unset, infer it from runtime inputs by excluding explicit cache/resource inputs.",
         "title": "Num Batched Inputs"
      },
      "backend": {
         "description": "The backend to use for compiling the model.",
         "enum": [
            "torch-simple",
            "torch-compile",
            "torch-cudagraph",
            "torch-opt"
         ],
         "title": "Backend",
         "type": "string"
      },
      "piecewise_enabled": {
         "default": false,
         "description": "Enable piecewise CUDA graph for prefill/mixed batches (dual-mode).",
         "title": "Piecewise Enabled",
         "type": "boolean"
      },
      "piecewise_num_tokens": {
         "anyOf": [
            {
               "items": {
                  "type": "integer"
               },
               "type": "array"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Total token counts to pre-capture piecewise CUDA graphs for. If null and piecewise_enabled=true, auto-generates power-of-2 buckets up to max_num_tokens (e.g. [64, 128, 256, ..., max_num_tokens]).",
         "title": "Piecewise Num Tokens"
      }
   },
   "$defs": {
      "Stages": {
         "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.",
         "enum": [
            "factory",
            "export",
            "post_export",
            "pattern_matcher",
            "sharding",
            "weight_load",
            "post_load_fusion",
            "cache_init",
            "visualize",
            "compile"
         ],
         "title": "Stages",
         "type": "string"
      }
   },
   "additionalProperties": true,
   "required": [
      "stage",
      "backend"
   ]
}

Config:
  • extra: str = allow

Fields:
  • backend (Literal['torch-simple', 'torch-compile', 'torch-cudagraph', 'torch-opt'])

  • cuda_graph_batch_sizes (List[int] | None)

  • num_batched_inputs (int | None)

  • piecewise_enabled (bool)

  • piecewise_num_tokens (List[int] | None)

Validators:
  • validate_piecewise_backend » all fields

field backend: Literal['torch-simple', 'torch-compile', 'torch-cudagraph', 'torch-opt'] [Required]

The backend to use for compiling the model.

field cuda_graph_batch_sizes: List[int] | None = None

The batch sizes to use for CUDA graphs.

field num_batched_inputs: int | None = None

The number of batched inputs to use for CUDA graphs. If unset, infer it from runtime inputs by excluding explicit cache/resource inputs.

Constraints:
  • ge = 1

field piecewise_enabled: bool = False

Enable piecewise CUDA graph for prefill/mixed batches (dual-mode).

field piecewise_num_tokens: List[int] | None = None

Total token counts to pre-capture piecewise CUDA graphs for. If null and piecewise_enabled=true, auto-generates power-of-2 buckets up to max_num_tokens (e.g. [64, 128, 256, …, max_num_tokens]).

validator validate_piecewise_backend  »  all fields[source]