Sharding Stage#

Sharding determines and applies distributed execution layout. These transforms identify tensor, expert, and batch-matmul sharding choices, then apply graph rewrites and communication hints needed for multi-rank execution.

Apply Sharding Hints#

Transform key: apply_sharding_hints

Source module: tensorrt_llm._torch.auto_deploy.transform.library.sharding_ir

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.sharding_ir.ApplyShardingHints( config: TransformConfig, )[source]#

Bases: BaseTransform

Deterministic, node-local sharding transform driven by hint kwargs.

Iterates graph nodes and applies sharding based on explicit hint arguments (tp_mode, tp_scaled_dim, tp_scale_sizes, etc.) together with the runtime DistConfig. No cross-node propagation, no topology inference.

When the FX graph contains no sharding-IR markers (no torch.ops.auto_deploy.all_reduce node), this transform is a no-op and leaves the graph for the legacy detect_sharding pipeline. Otherwise it sets gm.meta["sharding_ir_applied"] = True after applying, which Sharding / ShardingTransformExecutor read to skip themselves.

classmethod get_config_class() → Type[TransformConfig][source]#

Get the configuration class for the transform.

This is used to validate the configuration of the transform.

YAML configuration

The fields below can be set under this transform’s entry in the AutoDeploy config YAML.

pydantic model tensorrt_llm._torch.auto_deploy.transform.library.sharding_ir.IRShardingConfig[source]

Bases: TransformConfig

Minimal configuration for the hint-driven IR sharding transform.

Carries only the fields the IR pipeline reads. ShardingTransformConfig in sharding.py is the parallel config used by the heuristic-detection fallback for modeling files not yet ported to IR.

Show JSON schema

{
   "title": "IRShardingConfig",
   "description": "Minimal configuration for the hint-driven IR sharding transform.\n\nCarries only the fields the IR pipeline reads. ``ShardingTransformConfig``\nin ``sharding.py`` is the parallel config used by the heuristic-detection\nfallback for modeling files not yet ported to IR.",
   "type": "object",
   "properties": {
      "stage": {
         "$ref": "#/$defs/Stages",
         "description": "The stage of the transformation pipeline where this transform should run."
      },
      "run_per_gm": {
         "default": true,
         "description": "Whether to run the transform per graph (sub)module or on whole module.",
         "title": "Run Per Gm",
         "type": "boolean"
      },
      "enabled": {
         "default": true,
         "description": "Whether to enable this transform.",
         "title": "Enabled",
         "type": "boolean"
      },
      "skip_on_error": {
         "default": false,
         "description": "Whether to skip the transform if an error occurs.",
         "title": "Skip On Error",
         "type": "boolean"
      },
      "run_graph_cleanup": {
         "default": true,
         "description": "Whether to run graph cleanup/canonicalization after this transform.",
         "title": "Run Graph Cleanup",
         "type": "boolean"
      },
      "run_shape_prop": {
         "default": false,
         "description": "Whether to run shape propagation after this transform.",
         "title": "Run Shape Prop",
         "type": "boolean"
      },
      "requires_clean_graph": {
         "default": true,
         "description": "Whether this transform requires the graph to be clean before it is applied.",
         "title": "Requires Clean Graph",
         "type": "boolean"
      },
      "requires_shape_prop": {
         "default": false,
         "description": "Whether this transform requires shape propagation before it is applied.",
         "title": "Requires Shape Prop",
         "type": "boolean"
      },
      "debug_visualize_dir": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.",
         "title": "Debug Visualize Dir"
      },
      "expect_mem_change": {
         "default": false,
         "description": "Whether this transform is expected to cause changes in CUDA memory stats.",
         "title": "Expect Mem Change",
         "type": "boolean"
      },
      "allreduce_strategy": {
         "$ref": "#/$defs/AllReduceStrategy",
         "default": 3,
         "description": "AllReduce strategy for distributed operations."
      },
      "simple_shard_only": {
         "default": false,
         "title": "Simple Shard Only",
         "type": "boolean"
      },
      "shard_layers": {
         "anyOf": [
            {
               "items": {
                  "type": "string"
               },
               "type": "array"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "When set, only shard nodes whose layer_type hint is in this list.",
         "title": "Shard Layers"
      },
      "simple_shard_filter": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Comma-separated weight-name keywords (e.g. 'lm_head'). Matching linears are gather-sharded (column split + all_gather) regardless of shard_layers -- used for the lm_head vocab projection, which the hint-driven sharder would otherwise replicate.",
         "title": "Simple Shard Filter"
      },
      "enable_attention_dp": {
         "default": false,
         "title": "Enable Attention Dp",
         "type": "boolean"
      },
      "dist_mapping": {
         "additionalProperties": {
            "type": "integer"
         },
         "title": "Dist Mapping",
         "type": "object"
      },
      "dist_config": {
         "$ref": "#/$defs/DistConfig"
      }
   },
   "$defs": {
      "AllReduceStrategy": {
         "enum": [
            0,
            1,
            2,
            3,
            4,
            5,
            6,
            7,
            8,
            9
         ],
         "title": "AllReduceStrategy",
         "type": "integer"
      },
      "DistConfig": {
         "additionalProperties": true,
         "description": "Distributed parallelism configuration for AutoDeploy.",
         "properties": {
            "world_size": {
               "default": 1,
               "minimum": 1,
               "title": "World Size",
               "type": "integer"
            },
            "rank": {
               "default": 0,
               "minimum": 0,
               "title": "Rank",
               "type": "integer"
            },
            "tp_size": {
               "default": 1,
               "minimum": 1,
               "title": "Tp Size",
               "type": "integer"
            },
            "pp_size": {
               "default": 1,
               "minimum": 1,
               "title": "Pp Size",
               "type": "integer"
            },
            "moe_tp_size": {
               "default": 1,
               "minimum": 1,
               "title": "Moe Tp Size",
               "type": "integer"
            },
            "moe_ep_size": {
               "default": 1,
               "minimum": 1,
               "title": "Moe Ep Size",
               "type": "integer"
            },
            "moe_cluster_size": {
               "default": 1,
               "minimum": 1,
               "title": "Moe Cluster Size",
               "type": "integer"
            },
            "enable_attention_dp": {
               "default": false,
               "title": "Enable Attention Dp",
               "type": "boolean"
            },
            "allreduce_strategy": {
               "default": "NCCL",
               "title": "Allreduce Strategy",
               "type": "string"
            }
         },
         "title": "DistConfig",
         "type": "object"
      },
      "Stages": {
         "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.",
         "enum": [
            "factory",
            "export",
            "post_export",
            "pattern_matcher",
            "sharding",
            "weight_load",
            "post_load_fusion",
            "cache_init",
            "visualize",
            "compile"
         ],
         "title": "Stages",
         "type": "string"
      }
   },
   "additionalProperties": true,
   "required": [
      "stage"
   ]
}

Config:

extra: str = allow

Fields:

allreduce_strategy (tensorrt_llm.functional.AllReduceStrategy)
dist_config (tensorrt_llm._torch.auto_deploy.utils.dist_config.DistConfig)
dist_mapping (dict[str, int])
enable_attention_dp (bool)
shard_layers (List[str] | None)
simple_shard_filter (str | None)
simple_shard_only (bool)

Validators:

_validate_allreduce_strategy » allreduce_strategy

field allreduce_strategy: AllReduceStrategy = AllReduceStrategy.AUTO: AllReduce strategy for distributed operations.

field dist_config: DistConfig [Optional]

field dist_mapping: dict[str, int] [Optional]

field enable_attention_dp: bool = False

field shard_layers: List[str] | None = None: When set, only shard nodes whose layer_type hint is in this list.

field simple_shard_filter: str | None = None: Comma-separated weight-name keywords (e.g. ‘lm_head’). Matching linears are gather-sharded (column split + all_gather) regardless of shard_layers – used for the lm_head vocab projection, which the hint-driven sharder would otherwise replicate.

field simple_shard_only: bool = False

Detect Sharding#

Transform key: detect_sharding

Source module: tensorrt_llm._torch.auto_deploy.transform.library.sharding

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.sharding.Sharding( config: TransformConfig, )[source]#

Bases: BaseTransform

A transformation to apply sharding to the model following tensor parallelism.

The transformation is based on the following steps:

Identify boundary nodes between residual nodes to identify enable_sharding regions.
Identify the GEMM nodes that can be sharded
Trace through the subgraph using DFS/BFS between each pair of boundary nodes
Account for each node in the trace to ensure the op is correct even after sharding. This is necessary to ensure that the sharding is correct and we need to be able to account for all nodes in the subgraph. The subgraph here is defined as the region between the first linear node to the last linear node of an identified sharding region.

# 5. Shard the GEMM nodes or skip accordingly.

min_local_shape is the minimum size of the local tensor shard, to prevent TP parallelism splitting, e.g., the individual heads into smaller shards.

classmethod get_config_class() → Type[TransformConfig][source]#

Get the configuration class for the transform.

This is used to validate the configuration of the transform.

YAML configuration

The fields below can be set under this transform’s entry in the AutoDeploy config YAML.

pydantic model tensorrt_llm._torch.auto_deploy.transform.library.sharding.ShardingTransformConfig[source]

Bases: TransformConfig

Configuration for sharding the model.

Show JSON schema

{
   "title": "ShardingTransformConfig",
   "description": "Configuration for sharding the model.",
   "type": "object",
   "properties": {
      "stage": {
         "$ref": "#/$defs/Stages",
         "description": "The stage of the transformation pipeline where this transform should run."
      },
      "run_per_gm": {
         "default": true,
         "description": "Whether to run the transform per graph (sub)module or on whole module.",
         "title": "Run Per Gm",
         "type": "boolean"
      },
      "enabled": {
         "default": true,
         "description": "Whether to enable this transform.",
         "title": "Enabled",
         "type": "boolean"
      },
      "skip_on_error": {
         "default": false,
         "description": "Whether to skip the transform if an error occurs.",
         "title": "Skip On Error",
         "type": "boolean"
      },
      "run_graph_cleanup": {
         "default": true,
         "description": "Whether to run graph cleanup/canonicalization after this transform.",
         "title": "Run Graph Cleanup",
         "type": "boolean"
      },
      "run_shape_prop": {
         "default": false,
         "description": "Whether to run shape propagation after this transform.",
         "title": "Run Shape Prop",
         "type": "boolean"
      },
      "requires_clean_graph": {
         "default": true,
         "description": "Whether this transform requires the graph to be clean before it is applied.",
         "title": "Requires Clean Graph",
         "type": "boolean"
      },
      "requires_shape_prop": {
         "default": false,
         "description": "Whether this transform requires shape propagation before it is applied.",
         "title": "Requires Shape Prop",
         "type": "boolean"
      },
      "debug_visualize_dir": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.",
         "title": "Debug Visualize Dir"
      },
      "expect_mem_change": {
         "default": false,
         "description": "Whether this transform is expected to cause changes in CUDA memory stats.",
         "title": "Expect Mem Change",
         "type": "boolean"
      },
      "factory_source": {
         "$ref": "#/$defs/ShardingConfigSource",
         "default": "unknown"
      },
      "factory_config": {
         "additionalProperties": true,
         "title": "Factory Config",
         "type": "object"
      },
      "manual_config": {
         "additionalProperties": true,
         "title": "Manual Config",
         "type": "object"
      },
      "simple_shard_only": {
         "default": false,
         "title": "Simple Shard Only",
         "type": "boolean"
      },
      "support_partial_config": {
         "default": true,
         "title": "Support Partial Config",
         "type": "boolean"
      },
      "sharding_source": {
         "items": {
            "$ref": "#/$defs/ShardingSource"
         },
         "title": "Sharding Source",
         "type": "array"
      },
      "sharding_dims": {
         "items": {
            "$ref": "#/$defs/ShardingDim"
         },
         "title": "Sharding Dims",
         "type": "array"
      },
      "shard_all_unprocessed": {
         "default": false,
         "description": "When True, apply simple shard (column split + all_gather) to 'leftover' linear nodes that are not part of any layer subgraph.",
         "title": "Shard All Unprocessed",
         "type": "boolean"
      },
      "simple_shard_filter": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Comma-separated list of substrings to filter which unprocessed linear nodes are simple-sharded. A node is included if its name contains ANY of the listed keywords. Example: 'lm_head,shared_expert'. Only effective when shard_all_unprocessed is True. When None, all unprocessed linear nodes are sharded.",
         "title": "Simple Shard Filter"
      },
      "exclude_shard_node_filter": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Comma-separated list of substrings. Linear nodes whose name contains ANY of the listed substrings are kept TP-replicated across all ranks (no column-shard / row-shard / AllGather / AllReduce added) regardless of which sharding path (heuristic catch-all, MHA, MLA, MoE, MLP) selected them. Other sharding for the same layer (e.g. q_b_proj head-shard, o_proj row-shard) is unaffected. Use to match PT-style replicated patterns -- for example, DeepseekV3 keeps q_a_proj + kv_a_proj_with_mqa replicated (TP-unsharded, no AllGather) while still column-sharding q_b/kv_b by heads. Example: 'q_a_proj,kv_a_proj_with_mqa'. When None (default), all sharding paths run as-is.",
         "title": "Exclude Shard Node Filter"
      },
      "allreduce_strategy": {
         "$ref": "#/$defs/AllReduceStrategy",
         "default": 3,
         "description": "AllReduce strategy for distributed operations. Options: AUTO (automatic selection), NCCL, ONESHOT, TWOSHOT, MIN_LATENCY, LOWPRECISION, UB, MNNVL, NCCL_SYMMETRIC"
      },
      "allgather_strategy": {
         "$ref": "#/$defs/AllGatherStrategy",
         "default": "AUTO",
         "description": "AllGather strategy for distributed operations. Options: AUTO (NCCL AllGather), SYMM_MEM (symmetric memory with MULTIMEM, falls back to NCCL for unsupported cases)."
      },
      "dist_backend": {
         "$ref": "#/$defs/DistBackend",
         "default": "auto"
      },
      "enable_attention_dp": {
         "default": false,
         "description": "When True, skip TP sharding as attention data parallelism is enabled.",
         "title": "Enable Attention Dp",
         "type": "boolean"
      },
      "shard_layers": {
         "anyOf": [
            {
               "items": {
                  "type": "string"
               },
               "type": "array"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "When set, only shard nodes whose layer_type hint is in this list. Nodes with layer_type='unknown' or missing are NOT sharded. When None (default), all enable_sharding nodes are processed regardless of layer_type.",
         "title": "Shard Layers"
      },
      "dist_mapping": {
         "additionalProperties": {
            "type": "integer"
         },
         "title": "Dist Mapping",
         "type": "object"
      },
      "mapping": {
         "default": null,
         "title": "Mapping"
      },
      "dist_config": {
         "$ref": "#/$defs/DistConfig"
      }
   },
   "$defs": {
      "AllGatherStrategy": {
         "description": "Enum for AllGather strategy.\n\nAUTO: Use NCCL AllGather (default).\nSYMM_MEM: Use PyTorch symmetric memory with MULTIMEM hardware instructions.\n          Falls back to NCCL for unsupported cases (variable sizes, dim!=0, large tensors).",
         "enum": [
            "AUTO",
            "SYMM_MEM"
         ],
         "title": "AllGatherStrategy",
         "type": "string"
      },
      "AllReduceStrategy": {
         "enum": [
            0,
            1,
            2,
            3,
            4,
            5,
            6,
            7,
            8,
            9
         ],
         "title": "AllReduceStrategy",
         "type": "integer"
      },
      "DistBackend": {
         "description": "Enum for distributed backend.",
         "enum": [
            "auto",
            "trtllm",
            "torch"
         ],
         "title": "DistBackend",
         "type": "string"
      },
      "DistConfig": {
         "additionalProperties": true,
         "description": "Distributed parallelism configuration for AutoDeploy.",
         "properties": {
            "world_size": {
               "default": 1,
               "minimum": 1,
               "title": "World Size",
               "type": "integer"
            },
            "rank": {
               "default": 0,
               "minimum": 0,
               "title": "Rank",
               "type": "integer"
            },
            "tp_size": {
               "default": 1,
               "minimum": 1,
               "title": "Tp Size",
               "type": "integer"
            },
            "pp_size": {
               "default": 1,
               "minimum": 1,
               "title": "Pp Size",
               "type": "integer"
            },
            "moe_tp_size": {
               "default": 1,
               "minimum": 1,
               "title": "Moe Tp Size",
               "type": "integer"
            },
            "moe_ep_size": {
               "default": 1,
               "minimum": 1,
               "title": "Moe Ep Size",
               "type": "integer"
            },
            "moe_cluster_size": {
               "default": 1,
               "minimum": 1,
               "title": "Moe Cluster Size",
               "type": "integer"
            },
            "enable_attention_dp": {
               "default": false,
               "title": "Enable Attention Dp",
               "type": "boolean"
            },
            "allreduce_strategy": {
               "default": "NCCL",
               "title": "Allreduce Strategy",
               "type": "string"
            }
         },
         "title": "DistConfig",
         "type": "object"
      },
      "ShardingConfigSource": {
         "description": "Enum for factory source.",
         "enum": [
            "huggingface",
            "unknown"
         ],
         "title": "ShardingConfigSource",
         "type": "string"
      },
      "ShardingDim": {
         "description": "Enum for sharding dimension.",
         "enum": [
            "tp",
            "ep",
            "bmm"
         ],
         "title": "ShardingDim",
         "type": "string"
      },
      "ShardingSource": {
         "description": "Enum for sharding source.",
         "enum": [
            "heuristic",
            "factory",
            "manual"
         ],
         "title": "ShardingSource",
         "type": "string"
      },
      "Stages": {
         "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.",
         "enum": [
            "factory",
            "export",
            "post_export",
            "pattern_matcher",
            "sharding",
            "weight_load",
            "post_load_fusion",
            "cache_init",
            "visualize",
            "compile"
         ],
         "title": "Stages",
         "type": "string"
      }
   },
   "additionalProperties": true,
   "required": [
      "stage"
   ]
}

Config:

extra: str = allow
arbitrary_types_allowed: bool = True

Fields:

allgather_strategy (tensorrt_llm._torch.auto_deploy.transform.library.sharding.AllGatherStrategy)
allreduce_strategy (tensorrt_llm.functional.AllReduceStrategy)
dist_backend (tensorrt_llm._torch.auto_deploy.transform.library.sharding.DistBackend)
dist_config (tensorrt_llm._torch.auto_deploy.utils.dist_config.DistConfig)
dist_mapping (dict[str, int])
enable_attention_dp (bool)
exclude_shard_node_filter (str | None)
factory_config (Dict[str, Any])
factory_source (tensorrt_llm._torch.auto_deploy.models.factory.ShardingConfigSource)
manual_config (Dict[str, Any])
mapping (Any)
shard_all_unprocessed (bool)
shard_layers (List[str] | None)
sharding_dims (List[tensorrt_llm._torch.auto_deploy.transform.library.sharding.ShardingDim])
sharding_source (List[tensorrt_llm._torch.auto_deploy.transform.library.sharding.ShardingSource])
simple_shard_filter (str | None)
simple_shard_only (bool)
support_partial_config (bool)

Validators:

_validate_allgather_strategy » allgather_strategy
_validate_allreduce_strategy » allreduce_strategy

field allgather_strategy: AllGatherStrategy = AllGatherStrategy.AUTO: AllGather strategy for distributed operations. Options: AUTO (NCCL AllGather), SYMM_MEM (symmetric memory with MULTIMEM, falls back to NCCL for unsupported cases).

field allreduce_strategy: AllReduceStrategy = AllReduceStrategy.AUTO: AllReduce strategy for distributed operations. Options: AUTO (automatic selection), NCCL, ONESHOT, TWOSHOT, MIN_LATENCY, LOWPRECISION, UB, MNNVL, NCCL_SYMMETRIC

field dist_backend: DistBackend = DistBackend.AUTO

field dist_config: DistConfig [Optional]

field dist_mapping: dict[str, int] [Optional]

field enable_attention_dp: bool = False: When True, skip TP sharding as attention data parallelism is enabled.

field exclude_shard_node_filter: str | None = None: Comma-separated list of substrings. Linear nodes whose name contains ANY of the listed substrings are kept TP-replicated across all ranks (no column-shard / row-shard / AllGather / AllReduce added) regardless of which sharding path (heuristic catch-all, MHA, MLA, MoE, MLP) selected them. Other sharding for the same layer (e.g. q_b_proj head-shard, o_proj row-shard) is unaffected. Use to match PT-style replicated patterns – for example, DeepseekV3 keeps q_a_proj + kv_a_proj_with_mqa replicated (TP-unsharded, no AllGather) while still column-sharding q_b/kv_b by heads. Example: ‘q_a_proj,kv_a_proj_with_mqa’. When None (default), all sharding paths run as-is.

field factory_config: Dict[str, Any] [Optional]

field factory_source: ShardingConfigSource = ShardingConfigSource.UNKNOWN

field manual_config: Dict[str, Any] [Optional]

field mapping: Any = None

field shard_all_unprocessed: bool = False: When True, apply simple shard (column split + all_gather) to ‘leftover’ linear nodes that are not part of any layer subgraph.

field shard_layers: List[str] | None = None: When set, only shard nodes whose layer_type hint is in this list. Nodes with layer_type=’unknown’ or missing are NOT sharded. When None (default), all enable_sharding nodes are processed regardless of layer_type.

field sharding_dims: List[ShardingDim] [Optional]

field sharding_source: List[ShardingSource] [Optional]

field simple_shard_filter: str | None = None: Comma-separated list of substrings to filter which unprocessed linear nodes are simple-sharded. A node is included if its name contains ANY of the listed keywords. Example: ‘lm_head,shared_expert’. Only effective when shard_all_unprocessed is True. When None, all unprocessed linear nodes are sharded.

field simple_shard_only: bool = False

field support_partial_config: bool = True

validate_config( sources: ShardingSource | List[ShardingSource] = None, ) → bool[source]

Sharding Transform Executor#

Transform key: sharding_transform_executor

Source module: tensorrt_llm._torch.auto_deploy.transform.library.sharding

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.sharding.ShardingTransformExecutor( config: TransformConfig, )[source]#

Bases: BaseTransform

Apply transformations to the graph module.

Parameters:

gm – Graph module to apply transformations to
sharding_config – Transformation configuration containing list of transformations to apply

classmethod get_config_class() → Type[TransformConfig][source]#

Get the configuration class for the transform.

This is used to validate the configuration of the transform.

YAML configuration

The fields below can be set under this transform’s entry in the AutoDeploy config YAML.

pydantic model tensorrt_llm._torch.auto_deploy.transform.library.sharding.ShardingTransformConfig[source]

Bases: TransformConfig

Configuration for sharding the model.

Show JSON schema

{
   "title": "ShardingTransformConfig",
   "description": "Configuration for sharding the model.",
   "type": "object",
   "properties": {
      "stage": {
         "$ref": "#/$defs/Stages",
         "description": "The stage of the transformation pipeline where this transform should run."
      },
      "run_per_gm": {
         "default": true,
         "description": "Whether to run the transform per graph (sub)module or on whole module.",
         "title": "Run Per Gm",
         "type": "boolean"
      },
      "enabled": {
         "default": true,
         "description": "Whether to enable this transform.",
         "title": "Enabled",
         "type": "boolean"
      },
      "skip_on_error": {
         "default": false,
         "description": "Whether to skip the transform if an error occurs.",
         "title": "Skip On Error",
         "type": "boolean"
      },
      "run_graph_cleanup": {
         "default": true,
         "description": "Whether to run graph cleanup/canonicalization after this transform.",
         "title": "Run Graph Cleanup",
         "type": "boolean"
      },
      "run_shape_prop": {
         "default": false,
         "description": "Whether to run shape propagation after this transform.",
         "title": "Run Shape Prop",
         "type": "boolean"
      },
      "requires_clean_graph": {
         "default": true,
         "description": "Whether this transform requires the graph to be clean before it is applied.",
         "title": "Requires Clean Graph",
         "type": "boolean"
      },
      "requires_shape_prop": {
         "default": false,
         "description": "Whether this transform requires shape propagation before it is applied.",
         "title": "Requires Shape Prop",
         "type": "boolean"
      },
      "debug_visualize_dir": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.",
         "title": "Debug Visualize Dir"
      },
      "expect_mem_change": {
         "default": false,
         "description": "Whether this transform is expected to cause changes in CUDA memory stats.",
         "title": "Expect Mem Change",
         "type": "boolean"
      },
      "factory_source": {
         "$ref": "#/$defs/ShardingConfigSource",
         "default": "unknown"
      },
      "factory_config": {
         "additionalProperties": true,
         "title": "Factory Config",
         "type": "object"
      },
      "manual_config": {
         "additionalProperties": true,
         "title": "Manual Config",
         "type": "object"
      },
      "simple_shard_only": {
         "default": false,
         "title": "Simple Shard Only",
         "type": "boolean"
      },
      "support_partial_config": {
         "default": true,
         "title": "Support Partial Config",
         "type": "boolean"
      },
      "sharding_source": {
         "items": {
            "$ref": "#/$defs/ShardingSource"
         },
         "title": "Sharding Source",
         "type": "array"
      },
      "sharding_dims": {
         "items": {
            "$ref": "#/$defs/ShardingDim"
         },
         "title": "Sharding Dims",
         "type": "array"
      },
      "shard_all_unprocessed": {
         "default": false,
         "description": "When True, apply simple shard (column split + all_gather) to 'leftover' linear nodes that are not part of any layer subgraph.",
         "title": "Shard All Unprocessed",
         "type": "boolean"
      },
      "simple_shard_filter": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Comma-separated list of substrings to filter which unprocessed linear nodes are simple-sharded. A node is included if its name contains ANY of the listed keywords. Example: 'lm_head,shared_expert'. Only effective when shard_all_unprocessed is True. When None, all unprocessed linear nodes are sharded.",
         "title": "Simple Shard Filter"
      },
      "exclude_shard_node_filter": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Comma-separated list of substrings. Linear nodes whose name contains ANY of the listed substrings are kept TP-replicated across all ranks (no column-shard / row-shard / AllGather / AllReduce added) regardless of which sharding path (heuristic catch-all, MHA, MLA, MoE, MLP) selected them. Other sharding for the same layer (e.g. q_b_proj head-shard, o_proj row-shard) is unaffected. Use to match PT-style replicated patterns -- for example, DeepseekV3 keeps q_a_proj + kv_a_proj_with_mqa replicated (TP-unsharded, no AllGather) while still column-sharding q_b/kv_b by heads. Example: 'q_a_proj,kv_a_proj_with_mqa'. When None (default), all sharding paths run as-is.",
         "title": "Exclude Shard Node Filter"
      },
      "allreduce_strategy": {
         "$ref": "#/$defs/AllReduceStrategy",
         "default": 3,
         "description": "AllReduce strategy for distributed operations. Options: AUTO (automatic selection), NCCL, ONESHOT, TWOSHOT, MIN_LATENCY, LOWPRECISION, UB, MNNVL, NCCL_SYMMETRIC"
      },
      "allgather_strategy": {
         "$ref": "#/$defs/AllGatherStrategy",
         "default": "AUTO",
         "description": "AllGather strategy for distributed operations. Options: AUTO (NCCL AllGather), SYMM_MEM (symmetric memory with MULTIMEM, falls back to NCCL for unsupported cases)."
      },
      "dist_backend": {
         "$ref": "#/$defs/DistBackend",
         "default": "auto"
      },
      "enable_attention_dp": {
         "default": false,
         "description": "When True, skip TP sharding as attention data parallelism is enabled.",
         "title": "Enable Attention Dp",
         "type": "boolean"
      },
      "shard_layers": {
         "anyOf": [
            {
               "items": {
                  "type": "string"
               },
               "type": "array"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "When set, only shard nodes whose layer_type hint is in this list. Nodes with layer_type='unknown' or missing are NOT sharded. When None (default), all enable_sharding nodes are processed regardless of layer_type.",
         "title": "Shard Layers"
      },
      "dist_mapping": {
         "additionalProperties": {
            "type": "integer"
         },
         "title": "Dist Mapping",
         "type": "object"
      },
      "mapping": {
         "default": null,
         "title": "Mapping"
      },
      "dist_config": {
         "$ref": "#/$defs/DistConfig"
      }
   },
   "$defs": {
      "AllGatherStrategy": {
         "description": "Enum for AllGather strategy.\n\nAUTO: Use NCCL AllGather (default).\nSYMM_MEM: Use PyTorch symmetric memory with MULTIMEM hardware instructions.\n          Falls back to NCCL for unsupported cases (variable sizes, dim!=0, large tensors).",
         "enum": [
            "AUTO",
            "SYMM_MEM"
         ],
         "title": "AllGatherStrategy",
         "type": "string"
      },
      "AllReduceStrategy": {
         "enum": [
            0,
            1,
            2,
            3,
            4,
            5,
            6,
            7,
            8,
            9
         ],
         "title": "AllReduceStrategy",
         "type": "integer"
      },
      "DistBackend": {
         "description": "Enum for distributed backend.",
         "enum": [
            "auto",
            "trtllm",
            "torch"
         ],
         "title": "DistBackend",
         "type": "string"
      },
      "DistConfig": {
         "additionalProperties": true,
         "description": "Distributed parallelism configuration for AutoDeploy.",
         "properties": {
            "world_size": {
               "default": 1,
               "minimum": 1,
               "title": "World Size",
               "type": "integer"
            },
            "rank": {
               "default": 0,
               "minimum": 0,
               "title": "Rank",
               "type": "integer"
            },
            "tp_size": {
               "default": 1,
               "minimum": 1,
               "title": "Tp Size",
               "type": "integer"
            },
            "pp_size": {
               "default": 1,
               "minimum": 1,
               "title": "Pp Size",
               "type": "integer"
            },
            "moe_tp_size": {
               "default": 1,
               "minimum": 1,
               "title": "Moe Tp Size",
               "type": "integer"
            },
            "moe_ep_size": {
               "default": 1,
               "minimum": 1,
               "title": "Moe Ep Size",
               "type": "integer"
            },
            "moe_cluster_size": {
               "default": 1,
               "minimum": 1,
               "title": "Moe Cluster Size",
               "type": "integer"
            },
            "enable_attention_dp": {
               "default": false,
               "title": "Enable Attention Dp",
               "type": "boolean"
            },
            "allreduce_strategy": {
               "default": "NCCL",
               "title": "Allreduce Strategy",
               "type": "string"
            }
         },
         "title": "DistConfig",
         "type": "object"
      },
      "ShardingConfigSource": {
         "description": "Enum for factory source.",
         "enum": [
            "huggingface",
            "unknown"
         ],
         "title": "ShardingConfigSource",
         "type": "string"
      },
      "ShardingDim": {
         "description": "Enum for sharding dimension.",
         "enum": [
            "tp",
            "ep",
            "bmm"
         ],
         "title": "ShardingDim",
         "type": "string"
      },
      "ShardingSource": {
         "description": "Enum for sharding source.",
         "enum": [
            "heuristic",
            "factory",
            "manual"
         ],
         "title": "ShardingSource",
         "type": "string"
      },
      "Stages": {
         "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.",
         "enum": [
            "factory",
            "export",
            "post_export",
            "pattern_matcher",
            "sharding",
            "weight_load",
            "post_load_fusion",
            "cache_init",
            "visualize",
            "compile"
         ],
         "title": "Stages",
         "type": "string"
      }
   },
   "additionalProperties": true,
   "required": [
      "stage"
   ]
}

Config:

extra: str = allow
arbitrary_types_allowed: bool = True

Fields:

allgather_strategy (tensorrt_llm._torch.auto_deploy.transform.library.sharding.AllGatherStrategy)
allreduce_strategy (tensorrt_llm.functional.AllReduceStrategy)
debug_visualize_dir (Optional[str])
dist_backend (tensorrt_llm._torch.auto_deploy.transform.library.sharding.DistBackend)
dist_config (tensorrt_llm._torch.auto_deploy.utils.dist_config.DistConfig)
dist_mapping (dict[str, int])
enable_attention_dp (bool)
enabled (bool)
exclude_shard_node_filter (str | None)
expect_mem_change (bool)
factory_config (Dict[str, Any])
factory_source (tensorrt_llm._torch.auto_deploy.models.factory.ShardingConfigSource)
manual_config (Dict[str, Any])
mapping (Any)
requires_clean_graph (bool)
requires_shape_prop (bool)
run_graph_cleanup (bool)
run_per_gm (bool)
run_shape_prop (bool)
shard_all_unprocessed (bool)
shard_layers (List[str] | None)
sharding_dims (List[tensorrt_llm._torch.auto_deploy.transform.library.sharding.ShardingDim])
sharding_source (List[tensorrt_llm._torch.auto_deploy.transform.library.sharding.ShardingSource])
simple_shard_filter (str | None)
simple_shard_only (bool)
skip_on_error (bool)
stage (Stages)
support_partial_config (bool)

Validators:

_validate_allgather_strategy » allgather_strategy
_validate_allreduce_strategy » allreduce_strategy

field allgather_strategy: AllGatherStrategy = AllGatherStrategy.AUTO: AllGather strategy for distributed operations. Options: AUTO (NCCL AllGather), SYMM_MEM (symmetric memory with MULTIMEM, falls back to NCCL for unsupported cases).

field allreduce_strategy: AllReduceStrategy = AllReduceStrategy.AUTO: AllReduce strategy for distributed operations. Options: AUTO (automatic selection), NCCL, ONESHOT, TWOSHOT, MIN_LATENCY, LOWPRECISION, UB, MNNVL, NCCL_SYMMETRIC

field debug_visualize_dir: str | None = None: Debug visualization directory. None to disable visualization, or a path string to specify the output directory.

field dist_backend: DistBackend = DistBackend.AUTO

field dist_config: DistConfig [Optional]

field dist_mapping: dict[str, int] [Optional]

field enable_attention_dp: bool = False: When True, skip TP sharding as attention data parallelism is enabled.

field enabled: bool = True: Whether to enable this transform.

field exclude_shard_node_filter: str | None = None: Comma-separated list of substrings. Linear nodes whose name contains ANY of the listed substrings are kept TP-replicated across all ranks (no column-shard / row-shard / AllGather / AllReduce added) regardless of which sharding path (heuristic catch-all, MHA, MLA, MoE, MLP) selected them. Other sharding for the same layer (e.g. q_b_proj head-shard, o_proj row-shard) is unaffected. Use to match PT-style replicated patterns – for example, DeepseekV3 keeps q_a_proj + kv_a_proj_with_mqa replicated (TP-unsharded, no AllGather) while still column-sharding q_b/kv_b by heads. Example: ‘q_a_proj,kv_a_proj_with_mqa’. When None (default), all sharding paths run as-is.

field expect_mem_change: bool = False: Whether this transform is expected to cause changes in CUDA memory stats.

field factory_config: Dict[str, Any] [Optional]

field factory_source: ShardingConfigSource = ShardingConfigSource.UNKNOWN

field manual_config: Dict[str, Any] [Optional]

field mapping: Any = None

field requires_clean_graph: bool = True: Whether this transform requires the graph to be clean before it is applied.

field requires_shape_prop: bool = False: Whether this transform requires shape propagation before it is applied.

field run_graph_cleanup: bool = True: Whether to run graph cleanup/canonicalization after this transform.

field run_per_gm: bool = True: Whether to run the transform per graph (sub)module or on whole module.

field run_shape_prop: bool = False: Whether to run shape propagation after this transform.

field shard_all_unprocessed: bool = False: When True, apply simple shard (column split + all_gather) to ‘leftover’ linear nodes that are not part of any layer subgraph.

field shard_layers: List[str] | None = None: When set, only shard nodes whose layer_type hint is in this list. Nodes with layer_type=’unknown’ or missing are NOT sharded. When None (default), all enable_sharding nodes are processed regardless of layer_type.

field sharding_dims: List[ShardingDim] [Optional]

field sharding_source: List[ShardingSource] [Optional]

field simple_shard_filter: str | None = None: Comma-separated list of substrings to filter which unprocessed linear nodes are simple-sharded. A node is included if its name contains ANY of the listed keywords. Example: ‘lm_head,shared_expert’. Only effective when shard_all_unprocessed is True. When None, all unprocessed linear nodes are sharded.

field simple_shard_only: bool = False

field skip_on_error: bool = False: Whether to skip the transform if an error occurs.

field stage: Stages [Required]: The stage of the transformation pipeline where this transform should run.

field support_partial_config: bool = True

validate_config( sources: ShardingSource | List[ShardingSource] = None, ) → bool[source]

Pipeline Cache#

Transform key: pipeline_cache

Source module: tensorrt_llm._torch.auto_deploy.transform.pipeline_cache.pipeline_cache

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.pipeline_cache.pipeline_cache.PipelineCache( config: TransformConfig, )[source]#

Bases: BaseTransform

Transform that snapshots/restores the model at its configured pipeline position.

classmethod get_config_class() → type[TransformConfig][source]#

Get the configuration class for the transform.

This is used to validate the configuration of the transform.

maybe_restore( _cm: CachedSequenceInterface, factory: ModelFactory, shared_config: SharedConfig, transform_index: int, ) → Module | None[source]#: Return a cached module for this transform point, or None on a miss.

YAML configuration

The fields below can be set under this transform’s entry in the AutoDeploy config YAML.

pydantic model tensorrt_llm._torch.auto_deploy.transform.pipeline_cache.pipeline_cache.PipelineCacheConfig[source]

Bases: TransformConfig

Configuration for the torch-save pipeline cache transform.

Show JSON schema

{
   "title": "PipelineCacheConfig",
   "description": "Configuration for the torch-save pipeline cache transform.",
   "type": "object",
   "properties": {
      "stage": {
         "$ref": "#/$defs/Stages",
         "description": "The stage of the transformation pipeline where this transform should run."
      },
      "enabled": {
         "default": false,
         "description": "Whether to enable the torch-save pipeline cache transform.",
         "title": "Enabled",
         "type": "boolean"
      },
      "root": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Cache root. Defaults to ~/.cache/tensorrt_llm/auto_deploy/pipeline_cache when the transform is enabled.",
         "title": "Root"
      }
   },
   "$defs": {
      "Stages": {
         "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.",
         "enum": [
            "factory",
            "export",
            "post_export",
            "pattern_matcher",
            "sharding",
            "weight_load",
            "post_load_fusion",
            "cache_init",
            "visualize",
            "compile"
         ],
         "title": "Stages",
         "type": "string"
      }
   },
   "additionalProperties": false,
   "required": [
      "stage"
   ]
}

Config:

extra: str = forbid

Fields:

enabled (bool)
root (str | None)

Validators:

validate_enabled_cache » all fields

field enabled: bool = False: Whether to enable the torch-save pipeline cache transform.

field root: str | None = None: Cache root. Defaults to ~/.cache/tensorrt_llm/auto_deploy/pipeline_cache when the transform is enabled.

validator validate_enabled_cache » all fields[source]

debug_visualize_dir: ClassVar[str | None] = None

expect_mem_change: ClassVar[bool] = False

requires_clean_graph: ClassVar[bool] = False

requires_shape_prop: ClassVar[bool] = False

run_graph_cleanup: ClassVar[bool] = False

run_per_gm: ClassVar[bool] = False

run_shape_prop: ClassVar[bool] = False

skip_on_error: ClassVar[bool] = True