Cache Initialization Stage#

Cache initialization rewrites attention and recurrent state operations for cached inference. This stage prepares runtime cache resources such as KV-cache storage, SSM state, residual hidden-state capture, and model-specific cache metadata.

Insert Cached Attention#

Transform key: insert_cached_attention

Source module: tensorrt_llm._torch.auto_deploy.transform.library.kvcache

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.kvcache.InsertCachedAttention( config: TransformConfig, )[source]#

Bases: _InsertCachedOperator

A transform to insert cached attention into the graph module.

YAML configuration

The fields below can be set under this transform’s entry in the AutoDeploy config YAML.

pydantic model tensorrt_llm._torch.auto_deploy.transform.library.kvcache.InsertCachedAttentionConfig[source]

Bases: TransformConfig

Configuration for the insert cached attention transform.

Show JSON schema

{
   "title": "InsertCachedAttentionConfig",
   "description": "Configuration for the insert cached attention transform.",
   "type": "object",
   "properties": {
      "stage": {
         "$ref": "#/$defs/Stages",
         "description": "The stage of the transformation pipeline where this transform should run."
      },
      "run_per_gm": {
         "default": true,
         "description": "Whether to run the transform per graph (sub)module or on whole module.",
         "title": "Run Per Gm",
         "type": "boolean"
      },
      "enabled": {
         "default": true,
         "description": "Whether to enable this transform.",
         "title": "Enabled",
         "type": "boolean"
      },
      "skip_on_error": {
         "default": false,
         "description": "Whether to skip the transform if an error occurs.",
         "title": "Skip On Error",
         "type": "boolean"
      },
      "run_graph_cleanup": {
         "default": true,
         "description": "Whether to run graph cleanup/canonicalization after this transform.",
         "title": "Run Graph Cleanup",
         "type": "boolean"
      },
      "run_shape_prop": {
         "default": false,
         "description": "Whether to run shape propagation after this transform.",
         "title": "Run Shape Prop",
         "type": "boolean"
      },
      "requires_clean_graph": {
         "default": true,
         "description": "Whether this transform requires the graph to be clean before it is applied.",
         "title": "Requires Clean Graph",
         "type": "boolean"
      },
      "requires_shape_prop": {
         "default": false,
         "description": "Whether this transform requires shape propagation before it is applied.",
         "title": "Requires Shape Prop",
         "type": "boolean"
      },
      "debug_visualize_dir": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.",
         "title": "Debug Visualize Dir"
      },
      "expect_mem_change": {
         "default": false,
         "description": "Whether this transform is expected to cause changes in CUDA memory stats.",
         "title": "Expect Mem Change",
         "type": "boolean"
      },
      "backend": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "The attention backend to use.",
         "title": "Backend"
      }
   },
   "$defs": {
      "Stages": {
         "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.",
         "enum": [
            "factory",
            "export",
            "post_export",
            "pattern_matcher",
            "sharding",
            "weight_load",
            "post_load_fusion",
            "cache_init",
            "visualize",
            "compile"
         ],
         "title": "Stages",
         "type": "string"
      }
   },
   "additionalProperties": true,
   "required": [
      "stage"
   ]
}

Config:

extra: str = allow

Fields:

backend (str | None)

field backend: str | None = None: The attention backend to use.

Insert Cached MLA Attention#

Transform key: insert_cached_mla_attention

Source module: tensorrt_llm._torch.auto_deploy.transform.library.kvcache

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.kvcache.InsertCachedMLAAttention( config: TransformConfig, )[source]#

Bases: _InsertCachedOperator

A transform to insert cached MLA attention into the graph module.

classmethod resolve_backend_for_node( requested_backend: str | None, source_attn_node: Node, ) → str[source]#

Resolve the MLA backend for a node based on shape and local GPU support.

AutoDeploy’s current FlashInfer MLA integration is the Path 1 BatchMLAPagedAttentionWrapper route. That path is only validated for the DeepSeek-style shape contract on Hopper+ today, so unsupported MLA variants must fall back to the torch backend before cache insertion.

YAML configuration

The fields below can be set under this transform’s entry in the AutoDeploy config YAML.

pydantic model tensorrt_llm._torch.auto_deploy.transform.library.kvcache.InsertCachedAttentionConfig[source]

Bases: TransformConfig

Configuration for the insert cached attention transform.

Show JSON schema

{
   "title": "InsertCachedAttentionConfig",
   "description": "Configuration for the insert cached attention transform.",
   "type": "object",
   "properties": {
      "stage": {
         "$ref": "#/$defs/Stages",
         "description": "The stage of the transformation pipeline where this transform should run."
      },
      "run_per_gm": {
         "default": true,
         "description": "Whether to run the transform per graph (sub)module or on whole module.",
         "title": "Run Per Gm",
         "type": "boolean"
      },
      "enabled": {
         "default": true,
         "description": "Whether to enable this transform.",
         "title": "Enabled",
         "type": "boolean"
      },
      "skip_on_error": {
         "default": false,
         "description": "Whether to skip the transform if an error occurs.",
         "title": "Skip On Error",
         "type": "boolean"
      },
      "run_graph_cleanup": {
         "default": true,
         "description": "Whether to run graph cleanup/canonicalization after this transform.",
         "title": "Run Graph Cleanup",
         "type": "boolean"
      },
      "run_shape_prop": {
         "default": false,
         "description": "Whether to run shape propagation after this transform.",
         "title": "Run Shape Prop",
         "type": "boolean"
      },
      "requires_clean_graph": {
         "default": true,
         "description": "Whether this transform requires the graph to be clean before it is applied.",
         "title": "Requires Clean Graph",
         "type": "boolean"
      },
      "requires_shape_prop": {
         "default": false,
         "description": "Whether this transform requires shape propagation before it is applied.",
         "title": "Requires Shape Prop",
         "type": "boolean"
      },
      "debug_visualize_dir": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.",
         "title": "Debug Visualize Dir"
      },
      "expect_mem_change": {
         "default": false,
         "description": "Whether this transform is expected to cause changes in CUDA memory stats.",
         "title": "Expect Mem Change",
         "type": "boolean"
      },
      "backend": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "The attention backend to use.",
         "title": "Backend"
      }
   },
   "$defs": {
      "Stages": {
         "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.",
         "enum": [
            "factory",
            "export",
            "post_export",
            "pattern_matcher",
            "sharding",
            "weight_load",
            "post_load_fusion",
            "cache_init",
            "visualize",
            "compile"
         ],
         "title": "Stages",
         "type": "string"
      }
   },
   "additionalProperties": true,
   "required": [
      "stage"
   ]
}

Config:

extra: str = allow

Fields:

backend (str | None)
debug_visualize_dir (Optional[str])
enabled (bool)
expect_mem_change (bool)
requires_clean_graph (bool)
requires_shape_prop (bool)
run_graph_cleanup (bool)
run_per_gm (bool)
run_shape_prop (bool)
skip_on_error (bool)
stage (Stages)

field backend: str | None = None: The attention backend to use.

field debug_visualize_dir: str | None = None: Debug visualization directory. None to disable visualization, or a path string to specify the output directory.

field enabled: bool = True: Whether to enable this transform.

field expect_mem_change: bool = False: Whether this transform is expected to cause changes in CUDA memory stats.

field requires_clean_graph: bool = True: Whether this transform requires the graph to be clean before it is applied.

field requires_shape_prop: bool = False: Whether this transform requires shape propagation before it is applied.

field run_graph_cleanup: bool = True: Whether to run graph cleanup/canonicalization after this transform.

field run_per_gm: bool = True: Whether to run the transform per graph (sub)module or on whole module.

field run_shape_prop: bool = False: Whether to run shape propagation after this transform.

field skip_on_error: bool = False: Whether to skip the transform if an error occurs.

field stage: Stages [Required]: The stage of the transformation pipeline where this transform should run.

Insert Cached SSM Attention#

Transform key: insert_cached_ssm_attention

Source module: tensorrt_llm._torch.auto_deploy.transform.library.ssm_cache

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.ssm_cache.SSMCacheTransform( config: TransformConfig, )[source]#

Bases: _InsertCachedOperator

A transform to handle SSM cache operations.

classmethod get_config_class() → Type[TransformConfig][source]#

Get the configuration class for the transform.

This is used to validate the configuration of the transform.

YAML configuration

The fields below can be set under this transform’s entry in the AutoDeploy config YAML.

pydantic model tensorrt_llm._torch.auto_deploy.transform.library.ssm_cache.SSMCacheTransformConfig[source]

Bases: InsertCachedAttentionConfig

Configuration for insert_cached_ssm_attention.

Extends the base attention config with SSM-specific options.

Show JSON schema

{
   "title": "SSMCacheTransformConfig",
   "description": "Configuration for insert_cached_ssm_attention.\n\nExtends the base attention config with SSM-specific options.",
   "type": "object",
   "properties": {
      "stage": {
         "$ref": "#/$defs/Stages",
         "description": "The stage of the transformation pipeline where this transform should run."
      },
      "run_per_gm": {
         "default": true,
         "description": "Whether to run the transform per graph (sub)module or on whole module.",
         "title": "Run Per Gm",
         "type": "boolean"
      },
      "enabled": {
         "default": true,
         "description": "Whether to enable this transform.",
         "title": "Enabled",
         "type": "boolean"
      },
      "skip_on_error": {
         "default": false,
         "description": "Whether to skip the transform if an error occurs.",
         "title": "Skip On Error",
         "type": "boolean"
      },
      "run_graph_cleanup": {
         "default": true,
         "description": "Whether to run graph cleanup/canonicalization after this transform.",
         "title": "Run Graph Cleanup",
         "type": "boolean"
      },
      "run_shape_prop": {
         "default": false,
         "description": "Whether to run shape propagation after this transform.",
         "title": "Run Shape Prop",
         "type": "boolean"
      },
      "requires_clean_graph": {
         "default": true,
         "description": "Whether this transform requires the graph to be clean before it is applied.",
         "title": "Requires Clean Graph",
         "type": "boolean"
      },
      "requires_shape_prop": {
         "default": false,
         "description": "Whether this transform requires shape propagation before it is applied.",
         "title": "Requires Shape Prop",
         "type": "boolean"
      },
      "debug_visualize_dir": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.",
         "title": "Debug Visualize Dir"
      },
      "expect_mem_change": {
         "default": false,
         "description": "Whether this transform is expected to cause changes in CUDA memory stats.",
         "title": "Expect Mem Change",
         "type": "boolean"
      },
      "backend": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "The attention backend to use.",
         "title": "Backend"
      },
      "ssm_replay": {
         "default": false,
         "description": "Enable the replay SSM kernel (tl.dot fast-forward) for the MTP extend path. Requires SM >= 80 (Ampere+). Falls back to FlashInfer when disabled or when incompatible features are active (block reuse, tree attention).",
         "title": "Ssm Replay",
         "type": "boolean"
      }
   },
   "$defs": {
      "Stages": {
         "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.",
         "enum": [
            "factory",
            "export",
            "post_export",
            "pattern_matcher",
            "sharding",
            "weight_load",
            "post_load_fusion",
            "cache_init",
            "visualize",
            "compile"
         ],
         "title": "Stages",
         "type": "string"
      }
   },
   "additionalProperties": true,
   "required": [
      "stage"
   ]
}

Config:

extra: str = allow

Fields:

ssm_replay (bool)

field ssm_replay: bool = False: Enable the replay SSM kernel (tl.dot fast-forward) for the MTP extend path. Requires SM >= 80 (Ampere+). Falls back to FlashInfer when disabled or when incompatible features are active (block reuse, tree attention).

Insert Cached Causal Conv#

Transform key: insert_cached_causal_conv

Source module: tensorrt_llm._torch.auto_deploy.transform.library.ssm_cache

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.ssm_cache.InitializeCausalConvCache( config: TransformConfig, )[source]#

Bases: _InsertCachedOperator

A transform to handle causal conv cache operations.

YAML configuration

The fields below can be set under this transform’s entry in the AutoDeploy config YAML.

pydantic model tensorrt_llm._torch.auto_deploy.transform.library.kvcache.InsertCachedAttentionConfig[source]

Bases: TransformConfig

Configuration for the insert cached attention transform.

Show JSON schema

{
   "title": "InsertCachedAttentionConfig",
   "description": "Configuration for the insert cached attention transform.",
   "type": "object",
   "properties": {
      "stage": {
         "$ref": "#/$defs/Stages",
         "description": "The stage of the transformation pipeline where this transform should run."
      },
      "run_per_gm": {
         "default": true,
         "description": "Whether to run the transform per graph (sub)module or on whole module.",
         "title": "Run Per Gm",
         "type": "boolean"
      },
      "enabled": {
         "default": true,
         "description": "Whether to enable this transform.",
         "title": "Enabled",
         "type": "boolean"
      },
      "skip_on_error": {
         "default": false,
         "description": "Whether to skip the transform if an error occurs.",
         "title": "Skip On Error",
         "type": "boolean"
      },
      "run_graph_cleanup": {
         "default": true,
         "description": "Whether to run graph cleanup/canonicalization after this transform.",
         "title": "Run Graph Cleanup",
         "type": "boolean"
      },
      "run_shape_prop": {
         "default": false,
         "description": "Whether to run shape propagation after this transform.",
         "title": "Run Shape Prop",
         "type": "boolean"
      },
      "requires_clean_graph": {
         "default": true,
         "description": "Whether this transform requires the graph to be clean before it is applied.",
         "title": "Requires Clean Graph",
         "type": "boolean"
      },
      "requires_shape_prop": {
         "default": false,
         "description": "Whether this transform requires shape propagation before it is applied.",
         "title": "Requires Shape Prop",
         "type": "boolean"
      },
      "debug_visualize_dir": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.",
         "title": "Debug Visualize Dir"
      },
      "expect_mem_change": {
         "default": false,
         "description": "Whether this transform is expected to cause changes in CUDA memory stats.",
         "title": "Expect Mem Change",
         "type": "boolean"
      },
      "backend": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "The attention backend to use.",
         "title": "Backend"
      }
   },
   "$defs": {
      "Stages": {
         "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.",
         "enum": [
            "factory",
            "export",
            "post_export",
            "pattern_matcher",
            "sharding",
            "weight_load",
            "post_load_fusion",
            "cache_init",
            "visualize",
            "compile"
         ],
         "title": "Stages",
         "type": "string"
      }
   },
   "additionalProperties": true,
   "required": [
      "stage"
   ]
}

Config:

extra: str = allow

Fields:

backend (str | None)
debug_visualize_dir (Optional[str])
enabled (bool)
expect_mem_change (bool)
requires_clean_graph (bool)
requires_shape_prop (bool)
run_graph_cleanup (bool)
run_per_gm (bool)
run_shape_prop (bool)
skip_on_error (bool)
stage (Stages)

field backend: str | None = None: The attention backend to use.

field debug_visualize_dir: str | None = None: Debug visualization directory. None to disable visualization, or a path string to specify the output directory.

field enabled: bool = True: Whether to enable this transform.

field expect_mem_change: bool = False: Whether this transform is expected to cause changes in CUDA memory stats.

field requires_clean_graph: bool = True: Whether this transform requires the graph to be clean before it is applied.

field requires_shape_prop: bool = False: Whether this transform requires shape propagation before it is applied.

field run_graph_cleanup: bool = True: Whether to run graph cleanup/canonicalization after this transform.

field run_per_gm: bool = True: Whether to run the transform per graph (sub)module or on whole module.

field run_shape_prop: bool = False: Whether to run shape propagation after this transform.

field skip_on_error: bool = False: Whether to skip the transform if an error occurs.

field stage: Stages [Required]: The stage of the transformation pipeline where this transform should run.

Insert Cached Delta Rule#

Transform key: insert_cached_delta_rule

Source module: tensorrt_llm._torch.auto_deploy.transform.library.ssm_cache

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.ssm_cache.InsertCachedDeltaRule( config: TransformConfig, )[source]#

Bases: _InsertCachedOperator

A transform to handle delta rule cache operations.

YAML configuration

The fields below can be set under this transform’s entry in the AutoDeploy config YAML.

pydantic model tensorrt_llm._torch.auto_deploy.transform.library.kvcache.InsertCachedAttentionConfig[source]

Bases: TransformConfig

Configuration for the insert cached attention transform.

Show JSON schema

{
   "title": "InsertCachedAttentionConfig",
   "description": "Configuration for the insert cached attention transform.",
   "type": "object",
   "properties": {
      "stage": {
         "$ref": "#/$defs/Stages",
         "description": "The stage of the transformation pipeline where this transform should run."
      },
      "run_per_gm": {
         "default": true,
         "description": "Whether to run the transform per graph (sub)module or on whole module.",
         "title": "Run Per Gm",
         "type": "boolean"
      },
      "enabled": {
         "default": true,
         "description": "Whether to enable this transform.",
         "title": "Enabled",
         "type": "boolean"
      },
      "skip_on_error": {
         "default": false,
         "description": "Whether to skip the transform if an error occurs.",
         "title": "Skip On Error",
         "type": "boolean"
      },
      "run_graph_cleanup": {
         "default": true,
         "description": "Whether to run graph cleanup/canonicalization after this transform.",
         "title": "Run Graph Cleanup",
         "type": "boolean"
      },
      "run_shape_prop": {
         "default": false,
         "description": "Whether to run shape propagation after this transform.",
         "title": "Run Shape Prop",
         "type": "boolean"
      },
      "requires_clean_graph": {
         "default": true,
         "description": "Whether this transform requires the graph to be clean before it is applied.",
         "title": "Requires Clean Graph",
         "type": "boolean"
      },
      "requires_shape_prop": {
         "default": false,
         "description": "Whether this transform requires shape propagation before it is applied.",
         "title": "Requires Shape Prop",
         "type": "boolean"
      },
      "debug_visualize_dir": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.",
         "title": "Debug Visualize Dir"
      },
      "expect_mem_change": {
         "default": false,
         "description": "Whether this transform is expected to cause changes in CUDA memory stats.",
         "title": "Expect Mem Change",
         "type": "boolean"
      },
      "backend": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "The attention backend to use.",
         "title": "Backend"
      }
   },
   "$defs": {
      "Stages": {
         "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.",
         "enum": [
            "factory",
            "export",
            "post_export",
            "pattern_matcher",
            "sharding",
            "weight_load",
            "post_load_fusion",
            "cache_init",
            "visualize",
            "compile"
         ],
         "title": "Stages",
         "type": "string"
      }
   },
   "additionalProperties": true,
   "required": [
      "stage"
   ]
}

Config:

extra: str = allow

Fields:

backend (str | None)
debug_visualize_dir (Optional[str])
enabled (bool)
expect_mem_change (bool)
requires_clean_graph (bool)
requires_shape_prop (bool)
run_graph_cleanup (bool)
run_per_gm (bool)
run_shape_prop (bool)
skip_on_error (bool)
stage (Stages)

field backend: str | None = None: The attention backend to use.

field debug_visualize_dir: str | None = None: Debug visualization directory. None to disable visualization, or a path string to specify the output directory.

field enabled: bool = True: Whether to enable this transform.

field expect_mem_change: bool = False: Whether this transform is expected to cause changes in CUDA memory stats.

field requires_clean_graph: bool = True: Whether this transform requires the graph to be clean before it is applied.

field requires_shape_prop: bool = False: Whether this transform requires shape propagation before it is applied.

field run_graph_cleanup: bool = True: Whether to run graph cleanup/canonicalization after this transform.

field run_per_gm: bool = True: Whether to run the transform per graph (sub)module or on whole module.

field run_shape_prop: bool = False: Whether to run shape propagation after this transform.

field skip_on_error: bool = False: Whether to skip the transform if an error occurs.

field stage: Stages [Required]: The stage of the transformation pipeline where this transform should run.

Insert Cached Gated Delta Rule#

Transform key: insert_cached_gated_delta_rule

Source module: tensorrt_llm._torch.auto_deploy.transform.library.ssm_cache

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.ssm_cache.InsertCachedGatedDeltaRule( config: TransformConfig, )[source]#

Bases: _InsertCachedOperator

A transform to handle gated delta rule cache operations.

YAML configuration

The fields below can be set under this transform’s entry in the AutoDeploy config YAML.

pydantic model tensorrt_llm._torch.auto_deploy.transform.library.kvcache.InsertCachedAttentionConfig[source]

Bases: TransformConfig

Configuration for the insert cached attention transform.

Show JSON schema

{
   "title": "InsertCachedAttentionConfig",
   "description": "Configuration for the insert cached attention transform.",
   "type": "object",
   "properties": {
      "stage": {
         "$ref": "#/$defs/Stages",
         "description": "The stage of the transformation pipeline where this transform should run."
      },
      "run_per_gm": {
         "default": true,
         "description": "Whether to run the transform per graph (sub)module or on whole module.",
         "title": "Run Per Gm",
         "type": "boolean"
      },
      "enabled": {
         "default": true,
         "description": "Whether to enable this transform.",
         "title": "Enabled",
         "type": "boolean"
      },
      "skip_on_error": {
         "default": false,
         "description": "Whether to skip the transform if an error occurs.",
         "title": "Skip On Error",
         "type": "boolean"
      },
      "run_graph_cleanup": {
         "default": true,
         "description": "Whether to run graph cleanup/canonicalization after this transform.",
         "title": "Run Graph Cleanup",
         "type": "boolean"
      },
      "run_shape_prop": {
         "default": false,
         "description": "Whether to run shape propagation after this transform.",
         "title": "Run Shape Prop",
         "type": "boolean"
      },
      "requires_clean_graph": {
         "default": true,
         "description": "Whether this transform requires the graph to be clean before it is applied.",
         "title": "Requires Clean Graph",
         "type": "boolean"
      },
      "requires_shape_prop": {
         "default": false,
         "description": "Whether this transform requires shape propagation before it is applied.",
         "title": "Requires Shape Prop",
         "type": "boolean"
      },
      "debug_visualize_dir": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.",
         "title": "Debug Visualize Dir"
      },
      "expect_mem_change": {
         "default": false,
         "description": "Whether this transform is expected to cause changes in CUDA memory stats.",
         "title": "Expect Mem Change",
         "type": "boolean"
      },
      "backend": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "The attention backend to use.",
         "title": "Backend"
      }
   },
   "$defs": {
      "Stages": {
         "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.",
         "enum": [
            "factory",
            "export",
            "post_export",
            "pattern_matcher",
            "sharding",
            "weight_load",
            "post_load_fusion",
            "cache_init",
            "visualize",
            "compile"
         ],
         "title": "Stages",
         "type": "string"
      }
   },
   "additionalProperties": true,
   "required": [
      "stage"
   ]
}

Config:

extra: str = allow

Fields:

backend (str | None)
debug_visualize_dir (Optional[str])
enabled (bool)
expect_mem_change (bool)
requires_clean_graph (bool)
requires_shape_prop (bool)
run_graph_cleanup (bool)
run_per_gm (bool)
run_shape_prop (bool)
skip_on_error (bool)
stage (Stages)

field backend: str | None = None: The attention backend to use.

field debug_visualize_dir: str | None = None: Debug visualization directory. None to disable visualization, or a path string to specify the output directory.

field enabled: bool = True: Whether to enable this transform.

field expect_mem_change: bool = False: Whether this transform is expected to cause changes in CUDA memory stats.

field requires_clean_graph: bool = True: Whether this transform requires the graph to be clean before it is applied.

field requires_shape_prop: bool = False: Whether this transform requires shape propagation before it is applied.

field run_graph_cleanup: bool = True: Whether to run graph cleanup/canonicalization after this transform.

field run_per_gm: bool = True: Whether to run the transform per graph (sub)module or on whole module.

field run_shape_prop: bool = False: Whether to run shape propagation after this transform.

field skip_on_error: bool = False: Whether to skip the transform if an error occurs.

field stage: Stages [Required]: The stage of the transformation pipeline where this transform should run.

Insert Cached Residual Add#

Transform key: insert_cached_residual_add

Source module: tensorrt_llm._torch.auto_deploy.transform.library.hidden_states

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.hidden_states.InsertCachedResidualAdd( config: TransformConfig, )[source]#

Bases: _InsertCachedOperator

A transform to handle residual add cache operations.

YAML configuration

The fields below can be set under this transform’s entry in the AutoDeploy config YAML.

pydantic model tensorrt_llm._torch.auto_deploy.transform.library.kvcache.InsertCachedAttentionConfig[source]

Bases: TransformConfig

Configuration for the insert cached attention transform.

Show JSON schema

{
   "title": "InsertCachedAttentionConfig",
   "description": "Configuration for the insert cached attention transform.",
   "type": "object",
   "properties": {
      "stage": {
         "$ref": "#/$defs/Stages",
         "description": "The stage of the transformation pipeline where this transform should run."
      },
      "run_per_gm": {
         "default": true,
         "description": "Whether to run the transform per graph (sub)module or on whole module.",
         "title": "Run Per Gm",
         "type": "boolean"
      },
      "enabled": {
         "default": true,
         "description": "Whether to enable this transform.",
         "title": "Enabled",
         "type": "boolean"
      },
      "skip_on_error": {
         "default": false,
         "description": "Whether to skip the transform if an error occurs.",
         "title": "Skip On Error",
         "type": "boolean"
      },
      "run_graph_cleanup": {
         "default": true,
         "description": "Whether to run graph cleanup/canonicalization after this transform.",
         "title": "Run Graph Cleanup",
         "type": "boolean"
      },
      "run_shape_prop": {
         "default": false,
         "description": "Whether to run shape propagation after this transform.",
         "title": "Run Shape Prop",
         "type": "boolean"
      },
      "requires_clean_graph": {
         "default": true,
         "description": "Whether this transform requires the graph to be clean before it is applied.",
         "title": "Requires Clean Graph",
         "type": "boolean"
      },
      "requires_shape_prop": {
         "default": false,
         "description": "Whether this transform requires shape propagation before it is applied.",
         "title": "Requires Shape Prop",
         "type": "boolean"
      },
      "debug_visualize_dir": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.",
         "title": "Debug Visualize Dir"
      },
      "expect_mem_change": {
         "default": false,
         "description": "Whether this transform is expected to cause changes in CUDA memory stats.",
         "title": "Expect Mem Change",
         "type": "boolean"
      },
      "backend": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "The attention backend to use.",
         "title": "Backend"
      }
   },
   "$defs": {
      "Stages": {
         "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.",
         "enum": [
            "factory",
            "export",
            "post_export",
            "pattern_matcher",
            "sharding",
            "weight_load",
            "post_load_fusion",
            "cache_init",
            "visualize",
            "compile"
         ],
         "title": "Stages",
         "type": "string"
      }
   },
   "additionalProperties": true,
   "required": [
      "stage"
   ]
}

Config:

extra: str = allow

Fields:

backend (str | None)
debug_visualize_dir (Optional[str])
enabled (bool)
expect_mem_change (bool)
requires_clean_graph (bool)
requires_shape_prop (bool)
run_graph_cleanup (bool)
run_per_gm (bool)
run_shape_prop (bool)
skip_on_error (bool)
stage (Stages)

field backend: str | None = None: The attention backend to use.

field debug_visualize_dir: str | None = None: Debug visualization directory. None to disable visualization, or a path string to specify the output directory.

field enabled: bool = True: Whether to enable this transform.

field expect_mem_change: bool = False: Whether this transform is expected to cause changes in CUDA memory stats.

field requires_clean_graph: bool = True: Whether this transform requires the graph to be clean before it is applied.

field requires_shape_prop: bool = False: Whether this transform requires shape propagation before it is applied.

field run_graph_cleanup: bool = True: Whether to run graph cleanup/canonicalization after this transform.

field run_per_gm: bool = True: Whether to run the transform per graph (sub)module or on whole module.

field run_shape_prop: bool = False: Whether to run shape propagation after this transform.

field skip_on_error: bool = False: Whether to skip the transform if an error occurs.

field stage: Stages [Required]: The stage of the transformation pipeline where this transform should run.

Initialize mRoPE Delta Cache#

Transform key: initialize_mrope_delta_cache

Source module: tensorrt_llm._torch.auto_deploy.transform.library.mrope_delta_cache

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.mrope_delta_cache.InitializeMropeDeltaCache( config: TransformConfig, )[source]#

Bases: BaseTransform

Allocate a per-slot mrope_delta_cache for multimodal mRoPE models.

This transform is intentionally enabled per-model via config rather than auto-detected. TODO: if we want to make this generic-by-default, add a proper capability signal or reliable heuristic instead of source inspection.

classmethod get_config_class() → Type[TransformConfig][source]#

Get the configuration class for the transform.

This is used to validate the configuration of the transform.

YAML configuration

Uses the common TransformConfig fields documented in Core Transform APIs.

Initialize Cache#

Transform key: initialize_cache

Source module: tensorrt_llm._torch.auto_deploy.transform.library.kvcache

Configured modes: graph, transformers

class tensorrt_llm._torch.auto_deploy.transform.library.kvcache.InitializeCache( config: TransformConfig, )[source]#

Bases: BaseTransform

Initialize KV caches using KVCacheManager.

Gets kv_cache_config from shared_config.ad_config and creates the KVCacheManager in estimation mode with conservative capacity. The ResizeKVCache transform will later recreate it with optimal capacity after measuring memory usage.

YAML configuration

Uses the common TransformConfig fields documented in Core Transform APIs.

Resize KV Cache#

Transform key: resize_kv_cache

Source module: tensorrt_llm._torch.auto_deploy.transform.library.kvcache

Configured modes: graph, transformers

class tensorrt_llm._torch.auto_deploy.transform.library.kvcache.ResizeKVCache( config: TransformConfig, )[source]#

Bases: BaseTransform

Resize the KV cache to occupy available GPU memory.

This implements the two-phase approach: 1. Run a forward pass to allocate intermediate memory (activations, workspaces, etc.) 2. Call resize_kv_cache_manager() to recreate KVCacheManager with optimal capacity

YAML configuration

Uses the common TransformConfig fields documented in Core Transform APIs.

Detect Hf Attn Layers#

Transform key: detect_hf_attn_layers

Source module: tensorrt_llm._torch.auto_deploy.transform.library.kvcache_transformers

Configured modes: transformers

class tensorrt_llm._torch.auto_deploy.transform.library.kvcache_transformers.DetectHFAttnLayers( config: TransformConfig, )[source]#

Bases: BaseTransform

Detect the number of attn layers in the model and store a node-like reference for them.

This is achieved by running a single forward pass to profile the model.

YAML configuration

Uses the common TransformConfig fields documented in Core Transform APIs.

Transformers Replace Cached Attn#

Transform key: transformers_replace_cached_attn

Source module: tensorrt_llm._torch.auto_deploy.transform.library.kvcache_transformers

Configured modes: transformers

class tensorrt_llm._torch.auto_deploy.transform.library.kvcache_transformers.HFReplaceCachedAttn( config: TransformConfig, )[source]#

Bases: _InsertCachedOperator

Replace cached attention for the factory model, update inputs and outputs, and patch the gm forward.

YAML configuration

The fields below can be set under this transform’s entry in the AutoDeploy config YAML.

pydantic model tensorrt_llm._torch.auto_deploy.transform.library.kvcache.InsertCachedAttentionConfig[source]

Bases: TransformConfig

Configuration for the insert cached attention transform.

Show JSON schema

{
   "title": "InsertCachedAttentionConfig",
   "description": "Configuration for the insert cached attention transform.",
   "type": "object",
   "properties": {
      "stage": {
         "$ref": "#/$defs/Stages",
         "description": "The stage of the transformation pipeline where this transform should run."
      },
      "run_per_gm": {
         "default": true,
         "description": "Whether to run the transform per graph (sub)module or on whole module.",
         "title": "Run Per Gm",
         "type": "boolean"
      },
      "enabled": {
         "default": true,
         "description": "Whether to enable this transform.",
         "title": "Enabled",
         "type": "boolean"
      },
      "skip_on_error": {
         "default": false,
         "description": "Whether to skip the transform if an error occurs.",
         "title": "Skip On Error",
         "type": "boolean"
      },
      "run_graph_cleanup": {
         "default": true,
         "description": "Whether to run graph cleanup/canonicalization after this transform.",
         "title": "Run Graph Cleanup",
         "type": "boolean"
      },
      "run_shape_prop": {
         "default": false,
         "description": "Whether to run shape propagation after this transform.",
         "title": "Run Shape Prop",
         "type": "boolean"
      },
      "requires_clean_graph": {
         "default": true,
         "description": "Whether this transform requires the graph to be clean before it is applied.",
         "title": "Requires Clean Graph",
         "type": "boolean"
      },
      "requires_shape_prop": {
         "default": false,
         "description": "Whether this transform requires shape propagation before it is applied.",
         "title": "Requires Shape Prop",
         "type": "boolean"
      },
      "debug_visualize_dir": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.",
         "title": "Debug Visualize Dir"
      },
      "expect_mem_change": {
         "default": false,
         "description": "Whether this transform is expected to cause changes in CUDA memory stats.",
         "title": "Expect Mem Change",
         "type": "boolean"
      },
      "backend": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "The attention backend to use.",
         "title": "Backend"
      }
   },
   "$defs": {
      "Stages": {
         "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.",
         "enum": [
            "factory",
            "export",
            "post_export",
            "pattern_matcher",
            "sharding",
            "weight_load",
            "post_load_fusion",
            "cache_init",
            "visualize",
            "compile"
         ],
         "title": "Stages",
         "type": "string"
      }
   },
   "additionalProperties": true,
   "required": [
      "stage"
   ]
}

Config:

extra: str = allow

Fields:

backend (str | None)
debug_visualize_dir (Optional[str])
enabled (bool)
expect_mem_change (bool)
requires_clean_graph (bool)
requires_shape_prop (bool)
run_graph_cleanup (bool)
run_per_gm (bool)
run_shape_prop (bool)
skip_on_error (bool)
stage (Stages)

field backend: str | None = None: The attention backend to use.

field debug_visualize_dir: str | None = None: Debug visualization directory. None to disable visualization, or a path string to specify the output directory.

field enabled: bool = True: Whether to enable this transform.

field expect_mem_change: bool = False: Whether this transform is expected to cause changes in CUDA memory stats.

field requires_clean_graph: bool = True: Whether this transform requires the graph to be clean before it is applied.

field requires_shape_prop: bool = False: Whether this transform requires shape propagation before it is applied.

field run_graph_cleanup: bool = True: Whether to run graph cleanup/canonicalization after this transform.

field run_per_gm: bool = True: Whether to run the transform per graph (sub)module or on whole module.

field run_shape_prop: bool = False: Whether to run shape propagation after this transform.

field skip_on_error: bool = False: Whether to skip the transform if an error occurs.

field stage: Stages [Required]: The stage of the transformation pipeline where this transform should run.