Compilation Stage#
Compilation is the final transform stage before execution. It applies
runtime-oriented and compiler-oriented changes after graph structure, weights,
and caches are ready, such as multi-stream kernels, final cleanup, and CUDA graph
or torch.compile execution.
Fuse Causal Conv Activation#
Transform key: fuse_causal_conv_activation
Source module: tensorrt_llm._torch.auto_deploy.transform.library.fuse_causal_conv
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.fuse_causal_conv.FuseCausalConvActivation(
- config: TransformConfig,
Bases:
BaseTransformFuses activation functions into cached CUDA causal_conv1d operations.
- This transform detects patterns like:
conv_out = cuda_cached_causal_conv1d(…) out = silu(conv_out)
- And replaces them with:
out = cuda_cached_causal_conv1d(…, activation=”silu”)
This optimization allows the backend CUDA kernels to fuse the activation, reducing memory bandwidth and improving performance.
Note: This runs AFTER insert_cached_causal_conv, so it operates on the cached CUDA operations, not the uncached torch operations.
YAML configuration
Uses the common TransformConfig fields documented in Core Transform APIs.
Multi Stream MoE#
Transform key: multi_stream_moe
Source module: tensorrt_llm._torch.auto_deploy.transform.library.multi_stream_moe
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.multi_stream_moe.MultiStreamMOE(
- config: TransformConfig,
Bases:
BaseTransformMulti-stream execution of MoE layers that have shared experts and routed experts.
YAML configuration
Uses the common TransformConfig fields documented in Core Transform APIs.
Multi Stream MLA Attn#
Transform key: multi_stream_mla_attn
Source module: tensorrt_llm._torch.auto_deploy.transform.library.multi_stream_attn
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.multi_stream_attn.MultiStreamMLAAttn(
- config: TransformConfig,
Bases:
BaseTransformMulti-stream Q/KV parallelism for MLA attention blocks.
Pattern 0: Full KV path overlap for unfused Q/KV GEMMs (begin/end aux). Pattern 1: Overlaps KV projection linear with Q projection chain (fallback).
Pattern 0 is tried first; if it matches (unfused graph), pattern 1 is skipped. If pattern 0 finds nothing (fused graph), pattern 1 runs as fallback.
YAML configuration
Uses the common TransformConfig fields documented in Core Transform APIs.
Multi Stream Gemm#
Transform key: multi_stream_gemm
Source module: tensorrt_llm._torch.auto_deploy.transform.library.multi_stream_gemm
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.multi_stream_gemm.MultiStreamGemm(
- config: TransformConfig,
Bases:
BaseTransformMulti-stream parallelization of fp8 GEMMs sharing the same input.
For each fork point where 2+ fp8 linear ops share the same input tensor, the largest GEMM (by weight shape) is moved to the auxiliary CUDA stream so it executes concurrently with the remaining GEMMs on the main stream.
YAML configuration
Uses the common TransformConfig fields documented in Core Transform APIs.
Cleanup Identity Dtype Cast#
Transform key: cleanup_identity_dtype_cast
Source module: tensorrt_llm._torch.auto_deploy.transform.library.cleanup_identity_dtype_cast
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.cleanup_identity_dtype_cast.CleanupIdentityDtypeCast(
- config: TransformConfig,
Bases:
BaseTransformRemove identity dtype casts where input dtype already matches the target dtype.
Handles the three dtype-cast spellings commonly produced by tracing / functionalization:
aten.to.dtype(self, dtype, ...)— explicit.to(dtype)calls.aten._to_copy.default(self, dtype=..., ...)— functionalized form that always materializes a copy.prims.convert_element_type.default(self, dtype)— canonical torch. compile / torch.export primitive for dtype conversion.
Elimination only proceeds when the cast is semantically an identity: source dtype equals target dtype and no other observable attribute (copy flag, layout, device, pin_memory, memory_format) diverges from the input.
YAML configuration
Uses the common TransformConfig fields documented in Core Transform APIs.
Compile Model#
Transform key: compile_model
Source module: tensorrt_llm._torch.auto_deploy.transform.library.compile_model
Configured modes: graph
- class tensorrt_llm._torch.auto_deploy.transform.library.compile_model.CompileModel(
- config: TransformConfig,
Bases:
BaseTransformA transform to compile the model.
- classmethod get_config_class() Type[TransformConfig][source]#
Get the configuration class for the transform.
This is used to validate the configuration of the transform.
YAML configuration
The fields below can be set under this transform’s entry in the AutoDeploy config YAML.
- pydantic model tensorrt_llm._torch.auto_deploy.transform.library.compile_model.CompileModelConfig[source]
Bases:
TransformConfigConfiguration for the compile model transform.
Show JSON schema
{ "title": "CompileModelConfig", "description": "Configuration for the compile model transform.", "type": "object", "properties": { "stage": { "$ref": "#/$defs/Stages", "description": "The stage of the transformation pipeline where this transform should run." }, "run_per_gm": { "default": true, "description": "Whether to run the transform per graph (sub)module or on whole module.", "title": "Run Per Gm", "type": "boolean" }, "enabled": { "default": true, "description": "Whether to enable this transform.", "title": "Enabled", "type": "boolean" }, "skip_on_error": { "default": false, "description": "Whether to skip the transform if an error occurs.", "title": "Skip On Error", "type": "boolean" }, "run_graph_cleanup": { "default": true, "description": "Whether to run graph cleanup/canonicalization after this transform.", "title": "Run Graph Cleanup", "type": "boolean" }, "run_shape_prop": { "default": false, "description": "Whether to run shape propagation after this transform.", "title": "Run Shape Prop", "type": "boolean" }, "requires_clean_graph": { "default": true, "description": "Whether this transform requires the graph to be clean before it is applied.", "title": "Requires Clean Graph", "type": "boolean" }, "requires_shape_prop": { "default": false, "description": "Whether this transform requires shape propagation before it is applied.", "title": "Requires Shape Prop", "type": "boolean" }, "debug_visualize_dir": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.", "title": "Debug Visualize Dir" }, "expect_mem_change": { "default": false, "description": "Whether this transform is expected to cause changes in CUDA memory stats.", "title": "Expect Mem Change", "type": "boolean" }, "cuda_graph_batch_sizes": { "anyOf": [ { "items": { "type": "integer" }, "type": "array" }, { "type": "null" } ], "default": null, "description": "The batch sizes to use for CUDA graphs.", "title": "Cuda Graph Batch Sizes" }, "num_batched_inputs": { "anyOf": [ { "minimum": 1, "type": "integer" }, { "type": "null" } ], "default": null, "description": "The number of batched inputs to use for CUDA graphs. If unset, infer it from runtime inputs by excluding explicit cache/resource inputs.", "title": "Num Batched Inputs" }, "backend": { "description": "The backend to use for compiling the model.", "enum": [ "torch-simple", "torch-compile", "torch-cudagraph", "torch-opt" ], "title": "Backend", "type": "string" }, "piecewise_enabled": { "default": false, "description": "Enable piecewise CUDA graph for prefill/mixed batches (dual-mode).", "title": "Piecewise Enabled", "type": "boolean" }, "piecewise_num_tokens": { "anyOf": [ { "items": { "type": "integer" }, "type": "array" }, { "type": "null" } ], "default": null, "description": "Total token counts to pre-capture piecewise CUDA graphs for. If null and piecewise_enabled=true, auto-generates power-of-2 buckets up to max_num_tokens (e.g. [64, 128, 256, ..., max_num_tokens]).", "title": "Piecewise Num Tokens" } }, "$defs": { "Stages": { "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.", "enum": [ "factory", "export", "post_export", "pattern_matcher", "sharding", "weight_load", "post_load_fusion", "cache_init", "visualize", "compile" ], "title": "Stages", "type": "string" } }, "additionalProperties": true, "required": [ "stage", "backend" ] }
- Config:
extra: str = allow
- Fields:
backend (Literal['torch-simple', 'torch-compile', 'torch-cudagraph', 'torch-opt'])cuda_graph_batch_sizes (List[int] | None)num_batched_inputs (int | None)piecewise_enabled (bool)piecewise_num_tokens (List[int] | None)
- Validators:
validate_piecewise_backend»all fields
- field backend: Literal['torch-simple', 'torch-compile', 'torch-cudagraph', 'torch-opt'] [Required]
The backend to use for compiling the model.
- field cuda_graph_batch_sizes: List[int] | None = None
The batch sizes to use for CUDA graphs.
- field num_batched_inputs: int | None = None
The number of batched inputs to use for CUDA graphs. If unset, infer it from runtime inputs by excluding explicit cache/resource inputs.
- Constraints:
ge = 1
- field piecewise_enabled: bool = False
Enable piecewise CUDA graph for prefill/mixed batches (dual-mode).
- field piecewise_num_tokens: List[int] | None = None
Total token counts to pre-capture piecewise CUDA graphs for. If null and piecewise_enabled=true, auto-generates power-of-2 buckets up to max_num_tokens (e.g. [64, 128, 256, …, max_num_tokens]).
- validator validate_piecewise_backend » all fields[source]