Sharding Stage#

Sharding determines and applies distributed execution layout. These transforms identify tensor, expert, and batch-matmul sharding choices, then apply graph rewrites and communication hints needed for multi-rank execution.

Detect Sharding#

Transform key: detect_sharding

Source module: tensorrt_llm._torch.auto_deploy.transform.library.sharding

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.sharding.Sharding(
config: TransformConfig,
)[source]#

Bases: BaseTransform

A transformation to apply sharding to the model following tensor parallelism.

The transformation is based on the following steps:

  1. Identify boundary nodes between residual nodes to identify enable_sharding regions.

  2. Identify the GEMM nodes that can be sharded

  3. Trace through the subgraph using DFS/BFS between each pair of boundary nodes

  4. Account for each node in the trace to ensure the op is correct even after sharding. This is necessary to ensure that the sharding is correct and we need to be able to account for all nodes in the subgraph. The subgraph here is defined as the region between the first linear node to the last linear node of an identified sharding region.

# 5. Shard the GEMM nodes or skip accordingly.

min_local_shape is the minimum size of the local tensor shard, to prevent TP parallelism splitting, e.g., the individual heads into smaller shards.

classmethod get_config_class() Type[TransformConfig][source]#

Get the configuration class for the transform.

This is used to validate the configuration of the transform.

YAML configuration

The fields below can be set under this transform’s entry in the AutoDeploy config YAML.

pydantic model tensorrt_llm._torch.auto_deploy.transform.library.sharding.ShardingTransformConfig[source]

Bases: TransformConfig

Configuration for sharding the model.

Show JSON schema
{
   "title": "ShardingTransformConfig",
   "description": "Configuration for sharding the model.",
   "type": "object",
   "properties": {
      "stage": {
         "$ref": "#/$defs/Stages",
         "description": "The stage of the transformation pipeline where this transform should run."
      },
      "run_per_gm": {
         "default": true,
         "description": "Whether to run the transform per graph (sub)module or on whole module.",
         "title": "Run Per Gm",
         "type": "boolean"
      },
      "enabled": {
         "default": true,
         "description": "Whether to enable this transform.",
         "title": "Enabled",
         "type": "boolean"
      },
      "skip_on_error": {
         "default": false,
         "description": "Whether to skip the transform if an error occurs.",
         "title": "Skip On Error",
         "type": "boolean"
      },
      "run_graph_cleanup": {
         "default": true,
         "description": "Whether to run graph cleanup/canonicalization after this transform.",
         "title": "Run Graph Cleanup",
         "type": "boolean"
      },
      "run_shape_prop": {
         "default": false,
         "description": "Whether to run shape propagation after this transform.",
         "title": "Run Shape Prop",
         "type": "boolean"
      },
      "requires_clean_graph": {
         "default": true,
         "description": "Whether this transform requires the graph to be clean before it is applied.",
         "title": "Requires Clean Graph",
         "type": "boolean"
      },
      "requires_shape_prop": {
         "default": false,
         "description": "Whether this transform requires shape propagation before it is applied.",
         "title": "Requires Shape Prop",
         "type": "boolean"
      },
      "debug_visualize_dir": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.",
         "title": "Debug Visualize Dir"
      },
      "expect_mem_change": {
         "default": false,
         "description": "Whether this transform is expected to cause changes in CUDA memory stats.",
         "title": "Expect Mem Change",
         "type": "boolean"
      },
      "factory_source": {
         "$ref": "#/$defs/ShardingConfigSource",
         "default": "unknown"
      },
      "factory_config": {
         "additionalProperties": true,
         "title": "Factory Config",
         "type": "object"
      },
      "manual_config": {
         "additionalProperties": true,
         "title": "Manual Config",
         "type": "object"
      },
      "simple_shard_only": {
         "default": false,
         "title": "Simple Shard Only",
         "type": "boolean"
      },
      "support_partial_config": {
         "default": true,
         "title": "Support Partial Config",
         "type": "boolean"
      },
      "sharding_source": {
         "items": {
            "$ref": "#/$defs/ShardingSource"
         },
         "title": "Sharding Source",
         "type": "array"
      },
      "sharding_dims": {
         "items": {
            "$ref": "#/$defs/ShardingDim"
         },
         "title": "Sharding Dims",
         "type": "array"
      },
      "shard_all_unprocessed": {
         "default": false,
         "description": "When True, apply simple shard (column split + all_gather) to 'leftover' linear nodes that are not part of any layer subgraph.",
         "title": "Shard All Unprocessed",
         "type": "boolean"
      },
      "simple_shard_filter": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Comma-separated list of substrings to filter which unprocessed linear nodes are simple-sharded. A node is included if its name contains ANY of the listed keywords. Example: 'lm_head,shared_expert'. Only effective when shard_all_unprocessed is True. When None, all unprocessed linear nodes are sharded.",
         "title": "Simple Shard Filter"
      },
      "allreduce_strategy": {
         "$ref": "#/$defs/AllReduceStrategy",
         "default": 3,
         "description": "AllReduce strategy for distributed operations. Options: AUTO (automatic selection), NCCL, ONESHOT, TWOSHOT, MIN_LATENCY, LOWPRECISION, UB, MNNVL, NCCL_SYMMETRIC"
      },
      "allgather_strategy": {
         "$ref": "#/$defs/AllGatherStrategy",
         "default": "AUTO",
         "description": "AllGather strategy for distributed operations. Options: AUTO (NCCL AllGather), SYMM_MEM (symmetric memory with MULTIMEM, falls back to NCCL for unsupported cases)."
      },
      "dist_backend": {
         "$ref": "#/$defs/DistBackend",
         "default": "auto"
      },
      "enable_attention_dp": {
         "default": false,
         "description": "When True, skip TP sharding as attention data parallelism is enabled.",
         "title": "Enable Attention Dp",
         "type": "boolean"
      },
      "shard_layers": {
         "anyOf": [
            {
               "items": {
                  "type": "string"
               },
               "type": "array"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "When set, only shard nodes whose layer_type hint is in this list. Nodes with layer_type='unknown' or missing are NOT sharded. When None (default), all enable_sharding nodes are processed regardless of layer_type.",
         "title": "Shard Layers"
      },
      "dist_mapping": {
         "additionalProperties": {
            "type": "integer"
         },
         "title": "Dist Mapping",
         "type": "object"
      },
      "mapping": {
         "default": null,
         "title": "Mapping"
      },
      "dist_config": {
         "$ref": "#/$defs/DistConfig"
      }
   },
   "$defs": {
      "AllGatherStrategy": {
         "description": "Enum for AllGather strategy.\n\nAUTO: Use NCCL AllGather (default).\nSYMM_MEM: Use PyTorch symmetric memory with MULTIMEM hardware instructions.\n          Falls back to NCCL for unsupported cases (variable sizes, dim!=0, large tensors).",
         "enum": [
            "AUTO",
            "SYMM_MEM"
         ],
         "title": "AllGatherStrategy",
         "type": "string"
      },
      "AllReduceStrategy": {
         "enum": [
            0,
            1,
            2,
            3,
            4,
            5,
            6,
            7,
            8,
            9
         ],
         "title": "AllReduceStrategy",
         "type": "integer"
      },
      "DistBackend": {
         "description": "Enum for distributed backend.",
         "enum": [
            "auto",
            "trtllm",
            "torch"
         ],
         "title": "DistBackend",
         "type": "string"
      },
      "DistConfig": {
         "additionalProperties": true,
         "description": "Distributed parallelism configuration for AutoDeploy.",
         "properties": {
            "world_size": {
               "default": 1,
               "minimum": 1,
               "title": "World Size",
               "type": "integer"
            },
            "rank": {
               "default": 0,
               "minimum": 0,
               "title": "Rank",
               "type": "integer"
            },
            "tp_size": {
               "default": 1,
               "minimum": 1,
               "title": "Tp Size",
               "type": "integer"
            },
            "pp_size": {
               "default": 1,
               "minimum": 1,
               "title": "Pp Size",
               "type": "integer"
            },
            "moe_tp_size": {
               "default": 1,
               "minimum": 1,
               "title": "Moe Tp Size",
               "type": "integer"
            },
            "moe_ep_size": {
               "default": 1,
               "minimum": 1,
               "title": "Moe Ep Size",
               "type": "integer"
            },
            "moe_cluster_size": {
               "default": 1,
               "minimum": 1,
               "title": "Moe Cluster Size",
               "type": "integer"
            },
            "enable_attention_dp": {
               "default": false,
               "title": "Enable Attention Dp",
               "type": "boolean"
            },
            "allreduce_strategy": {
               "default": "NCCL",
               "title": "Allreduce Strategy",
               "type": "string"
            }
         },
         "title": "DistConfig",
         "type": "object"
      },
      "ShardingConfigSource": {
         "description": "Enum for factory source.",
         "enum": [
            "huggingface",
            "unknown"
         ],
         "title": "ShardingConfigSource",
         "type": "string"
      },
      "ShardingDim": {
         "description": "Enum for sharding dimension.",
         "enum": [
            "tp",
            "ep",
            "bmm"
         ],
         "title": "ShardingDim",
         "type": "string"
      },
      "ShardingSource": {
         "description": "Enum for sharding source.",
         "enum": [
            "heuristic",
            "factory",
            "manual"
         ],
         "title": "ShardingSource",
         "type": "string"
      },
      "Stages": {
         "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.",
         "enum": [
            "factory",
            "export",
            "post_export",
            "pattern_matcher",
            "sharding",
            "weight_load",
            "post_load_fusion",
            "cache_init",
            "visualize",
            "compile"
         ],
         "title": "Stages",
         "type": "string"
      }
   },
   "additionalProperties": true,
   "required": [
      "stage"
   ]
}

Config:
  • extra: str = allow

  • arbitrary_types_allowed: bool = True

Fields:
  • allgather_strategy (tensorrt_llm._torch.auto_deploy.transform.library.sharding.AllGatherStrategy)

  • allreduce_strategy (tensorrt_llm.functional.AllReduceStrategy)

  • dist_backend (tensorrt_llm._torch.auto_deploy.transform.library.sharding.DistBackend)

  • dist_config (tensorrt_llm._torch.auto_deploy.utils.dist_config.DistConfig)

  • dist_mapping (dict[str, int])

  • enable_attention_dp (bool)

  • factory_config (Dict[str, Any])

  • factory_source (tensorrt_llm._torch.auto_deploy.models.factory.ShardingConfigSource)

  • manual_config (Dict[str, Any])

  • mapping (Any)

  • shard_all_unprocessed (bool)

  • shard_layers (List[str] | None)

  • sharding_dims (List[tensorrt_llm._torch.auto_deploy.transform.library.sharding.ShardingDim])

  • sharding_source (List[tensorrt_llm._torch.auto_deploy.transform.library.sharding.ShardingSource])

  • simple_shard_filter (str | None)

  • simple_shard_only (bool)

  • support_partial_config (bool)

Validators:
  • _validate_allgather_strategy » allgather_strategy

  • _validate_allreduce_strategy » allreduce_strategy

field allgather_strategy: AllGatherStrategy = AllGatherStrategy.AUTO

AllGather strategy for distributed operations. Options: AUTO (NCCL AllGather), SYMM_MEM (symmetric memory with MULTIMEM, falls back to NCCL for unsupported cases).

field allreduce_strategy: AllReduceStrategy = AllReduceStrategy.AUTO

AllReduce strategy for distributed operations. Options: AUTO (automatic selection), NCCL, ONESHOT, TWOSHOT, MIN_LATENCY, LOWPRECISION, UB, MNNVL, NCCL_SYMMETRIC

field dist_backend: DistBackend = DistBackend.AUTO
field dist_config: DistConfig [Optional]
field dist_mapping: dict[str, int] [Optional]
field enable_attention_dp: bool = False

When True, skip TP sharding as attention data parallelism is enabled.

field factory_config: Dict[str, Any] [Optional]
field factory_source: ShardingConfigSource = ShardingConfigSource.UNKNOWN
field manual_config: Dict[str, Any] [Optional]
field mapping: Any = None
field shard_all_unprocessed: bool = False

When True, apply simple shard (column split + all_gather) to ‘leftover’ linear nodes that are not part of any layer subgraph.

field shard_layers: List[str] | None = None

When set, only shard nodes whose layer_type hint is in this list. Nodes with layer_type=’unknown’ or missing are NOT sharded. When None (default), all enable_sharding nodes are processed regardless of layer_type.

field sharding_dims: List[ShardingDim] [Optional]
field sharding_source: List[ShardingSource] [Optional]
field simple_shard_filter: str | None = None

Comma-separated list of substrings to filter which unprocessed linear nodes are simple-sharded. A node is included if its name contains ANY of the listed keywords. Example: ‘lm_head,shared_expert’. Only effective when shard_all_unprocessed is True. When None, all unprocessed linear nodes are sharded.

field simple_shard_only: bool = False
field support_partial_config: bool = True
validate_config(
sources: ShardingSource | List[ShardingSource] = None,
) bool[source]

Sharding Transform Executor#

Transform key: sharding_transform_executor

Source module: tensorrt_llm._torch.auto_deploy.transform.library.sharding

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.sharding.ShardingTransformExecutor(
config: TransformConfig,
)[source]#

Bases: BaseTransform

Apply transformations to the graph module.

Parameters:
  • gm – Graph module to apply transformations to

  • sharding_config – Transformation configuration containing list of transformations to apply

classmethod get_config_class() Type[TransformConfig][source]#

Get the configuration class for the transform.

This is used to validate the configuration of the transform.

YAML configuration

The fields below can be set under this transform’s entry in the AutoDeploy config YAML.

pydantic model tensorrt_llm._torch.auto_deploy.transform.library.sharding.ShardingTransformConfig[source]

Bases: TransformConfig

Configuration for sharding the model.

Show JSON schema
{
   "title": "ShardingTransformConfig",
   "description": "Configuration for sharding the model.",
   "type": "object",
   "properties": {
      "stage": {
         "$ref": "#/$defs/Stages",
         "description": "The stage of the transformation pipeline where this transform should run."
      },
      "run_per_gm": {
         "default": true,
         "description": "Whether to run the transform per graph (sub)module or on whole module.",
         "title": "Run Per Gm",
         "type": "boolean"
      },
      "enabled": {
         "default": true,
         "description": "Whether to enable this transform.",
         "title": "Enabled",
         "type": "boolean"
      },
      "skip_on_error": {
         "default": false,
         "description": "Whether to skip the transform if an error occurs.",
         "title": "Skip On Error",
         "type": "boolean"
      },
      "run_graph_cleanup": {
         "default": true,
         "description": "Whether to run graph cleanup/canonicalization after this transform.",
         "title": "Run Graph Cleanup",
         "type": "boolean"
      },
      "run_shape_prop": {
         "default": false,
         "description": "Whether to run shape propagation after this transform.",
         "title": "Run Shape Prop",
         "type": "boolean"
      },
      "requires_clean_graph": {
         "default": true,
         "description": "Whether this transform requires the graph to be clean before it is applied.",
         "title": "Requires Clean Graph",
         "type": "boolean"
      },
      "requires_shape_prop": {
         "default": false,
         "description": "Whether this transform requires shape propagation before it is applied.",
         "title": "Requires Shape Prop",
         "type": "boolean"
      },
      "debug_visualize_dir": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.",
         "title": "Debug Visualize Dir"
      },
      "expect_mem_change": {
         "default": false,
         "description": "Whether this transform is expected to cause changes in CUDA memory stats.",
         "title": "Expect Mem Change",
         "type": "boolean"
      },
      "factory_source": {
         "$ref": "#/$defs/ShardingConfigSource",
         "default": "unknown"
      },
      "factory_config": {
         "additionalProperties": true,
         "title": "Factory Config",
         "type": "object"
      },
      "manual_config": {
         "additionalProperties": true,
         "title": "Manual Config",
         "type": "object"
      },
      "simple_shard_only": {
         "default": false,
         "title": "Simple Shard Only",
         "type": "boolean"
      },
      "support_partial_config": {
         "default": true,
         "title": "Support Partial Config",
         "type": "boolean"
      },
      "sharding_source": {
         "items": {
            "$ref": "#/$defs/ShardingSource"
         },
         "title": "Sharding Source",
         "type": "array"
      },
      "sharding_dims": {
         "items": {
            "$ref": "#/$defs/ShardingDim"
         },
         "title": "Sharding Dims",
         "type": "array"
      },
      "shard_all_unprocessed": {
         "default": false,
         "description": "When True, apply simple shard (column split + all_gather) to 'leftover' linear nodes that are not part of any layer subgraph.",
         "title": "Shard All Unprocessed",
         "type": "boolean"
      },
      "simple_shard_filter": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Comma-separated list of substrings to filter which unprocessed linear nodes are simple-sharded. A node is included if its name contains ANY of the listed keywords. Example: 'lm_head,shared_expert'. Only effective when shard_all_unprocessed is True. When None, all unprocessed linear nodes are sharded.",
         "title": "Simple Shard Filter"
      },
      "allreduce_strategy": {
         "$ref": "#/$defs/AllReduceStrategy",
         "default": 3,
         "description": "AllReduce strategy for distributed operations. Options: AUTO (automatic selection), NCCL, ONESHOT, TWOSHOT, MIN_LATENCY, LOWPRECISION, UB, MNNVL, NCCL_SYMMETRIC"
      },
      "allgather_strategy": {
         "$ref": "#/$defs/AllGatherStrategy",
         "default": "AUTO",
         "description": "AllGather strategy for distributed operations. Options: AUTO (NCCL AllGather), SYMM_MEM (symmetric memory with MULTIMEM, falls back to NCCL for unsupported cases)."
      },
      "dist_backend": {
         "$ref": "#/$defs/DistBackend",
         "default": "auto"
      },
      "enable_attention_dp": {
         "default": false,
         "description": "When True, skip TP sharding as attention data parallelism is enabled.",
         "title": "Enable Attention Dp",
         "type": "boolean"
      },
      "shard_layers": {
         "anyOf": [
            {
               "items": {
                  "type": "string"
               },
               "type": "array"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "When set, only shard nodes whose layer_type hint is in this list. Nodes with layer_type='unknown' or missing are NOT sharded. When None (default), all enable_sharding nodes are processed regardless of layer_type.",
         "title": "Shard Layers"
      },
      "dist_mapping": {
         "additionalProperties": {
            "type": "integer"
         },
         "title": "Dist Mapping",
         "type": "object"
      },
      "mapping": {
         "default": null,
         "title": "Mapping"
      },
      "dist_config": {
         "$ref": "#/$defs/DistConfig"
      }
   },
   "$defs": {
      "AllGatherStrategy": {
         "description": "Enum for AllGather strategy.\n\nAUTO: Use NCCL AllGather (default).\nSYMM_MEM: Use PyTorch symmetric memory with MULTIMEM hardware instructions.\n          Falls back to NCCL for unsupported cases (variable sizes, dim!=0, large tensors).",
         "enum": [
            "AUTO",
            "SYMM_MEM"
         ],
         "title": "AllGatherStrategy",
         "type": "string"
      },
      "AllReduceStrategy": {
         "enum": [
            0,
            1,
            2,
            3,
            4,
            5,
            6,
            7,
            8,
            9
         ],
         "title": "AllReduceStrategy",
         "type": "integer"
      },
      "DistBackend": {
         "description": "Enum for distributed backend.",
         "enum": [
            "auto",
            "trtllm",
            "torch"
         ],
         "title": "DistBackend",
         "type": "string"
      },
      "DistConfig": {
         "additionalProperties": true,
         "description": "Distributed parallelism configuration for AutoDeploy.",
         "properties": {
            "world_size": {
               "default": 1,
               "minimum": 1,
               "title": "World Size",
               "type": "integer"
            },
            "rank": {
               "default": 0,
               "minimum": 0,
               "title": "Rank",
               "type": "integer"
            },
            "tp_size": {
               "default": 1,
               "minimum": 1,
               "title": "Tp Size",
               "type": "integer"
            },
            "pp_size": {
               "default": 1,
               "minimum": 1,
               "title": "Pp Size",
               "type": "integer"
            },
            "moe_tp_size": {
               "default": 1,
               "minimum": 1,
               "title": "Moe Tp Size",
               "type": "integer"
            },
            "moe_ep_size": {
               "default": 1,
               "minimum": 1,
               "title": "Moe Ep Size",
               "type": "integer"
            },
            "moe_cluster_size": {
               "default": 1,
               "minimum": 1,
               "title": "Moe Cluster Size",
               "type": "integer"
            },
            "enable_attention_dp": {
               "default": false,
               "title": "Enable Attention Dp",
               "type": "boolean"
            },
            "allreduce_strategy": {
               "default": "NCCL",
               "title": "Allreduce Strategy",
               "type": "string"
            }
         },
         "title": "DistConfig",
         "type": "object"
      },
      "ShardingConfigSource": {
         "description": "Enum for factory source.",
         "enum": [
            "huggingface",
            "unknown"
         ],
         "title": "ShardingConfigSource",
         "type": "string"
      },
      "ShardingDim": {
         "description": "Enum for sharding dimension.",
         "enum": [
            "tp",
            "ep",
            "bmm"
         ],
         "title": "ShardingDim",
         "type": "string"
      },
      "ShardingSource": {
         "description": "Enum for sharding source.",
         "enum": [
            "heuristic",
            "factory",
            "manual"
         ],
         "title": "ShardingSource",
         "type": "string"
      },
      "Stages": {
         "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.",
         "enum": [
            "factory",
            "export",
            "post_export",
            "pattern_matcher",
            "sharding",
            "weight_load",
            "post_load_fusion",
            "cache_init",
            "visualize",
            "compile"
         ],
         "title": "Stages",
         "type": "string"
      }
   },
   "additionalProperties": true,
   "required": [
      "stage"
   ]
}

Config:
  • extra: str = allow

  • arbitrary_types_allowed: bool = True

Fields:
  • allgather_strategy (tensorrt_llm._torch.auto_deploy.transform.library.sharding.AllGatherStrategy)

  • allreduce_strategy (tensorrt_llm.functional.AllReduceStrategy)

  • debug_visualize_dir (Optional[str])

  • dist_backend (tensorrt_llm._torch.auto_deploy.transform.library.sharding.DistBackend)

  • dist_config (tensorrt_llm._torch.auto_deploy.utils.dist_config.DistConfig)

  • dist_mapping (dict[str, int])

  • enable_attention_dp (bool)

  • enabled (bool)

  • expect_mem_change (bool)

  • factory_config (Dict[str, Any])

  • factory_source (tensorrt_llm._torch.auto_deploy.models.factory.ShardingConfigSource)

  • manual_config (Dict[str, Any])

  • mapping (Any)

  • requires_clean_graph (bool)

  • requires_shape_prop (bool)

  • run_graph_cleanup (bool)

  • run_per_gm (bool)

  • run_shape_prop (bool)

  • shard_all_unprocessed (bool)

  • shard_layers (List[str] | None)

  • sharding_dims (List[tensorrt_llm._torch.auto_deploy.transform.library.sharding.ShardingDim])

  • sharding_source (List[tensorrt_llm._torch.auto_deploy.transform.library.sharding.ShardingSource])

  • simple_shard_filter (str | None)

  • simple_shard_only (bool)

  • skip_on_error (bool)

  • stage (Stages)

  • support_partial_config (bool)

Validators:
  • _validate_allgather_strategy » allgather_strategy

  • _validate_allreduce_strategy » allreduce_strategy

field allgather_strategy: AllGatherStrategy = AllGatherStrategy.AUTO

AllGather strategy for distributed operations. Options: AUTO (NCCL AllGather), SYMM_MEM (symmetric memory with MULTIMEM, falls back to NCCL for unsupported cases).

field allreduce_strategy: AllReduceStrategy = AllReduceStrategy.AUTO

AllReduce strategy for distributed operations. Options: AUTO (automatic selection), NCCL, ONESHOT, TWOSHOT, MIN_LATENCY, LOWPRECISION, UB, MNNVL, NCCL_SYMMETRIC

field debug_visualize_dir: str | None = None

Debug visualization directory. None to disable visualization, or a path string to specify the output directory.

field dist_backend: DistBackend = DistBackend.AUTO
field dist_config: DistConfig [Optional]
field dist_mapping: dict[str, int] [Optional]
field enable_attention_dp: bool = False

When True, skip TP sharding as attention data parallelism is enabled.

field enabled: bool = True

Whether to enable this transform.

field expect_mem_change: bool = False

Whether this transform is expected to cause changes in CUDA memory stats.

field factory_config: Dict[str, Any] [Optional]
field factory_source: ShardingConfigSource = ShardingConfigSource.UNKNOWN
field manual_config: Dict[str, Any] [Optional]
field mapping: Any = None
field requires_clean_graph: bool = True

Whether this transform requires the graph to be clean before it is applied.

field requires_shape_prop: bool = False

Whether this transform requires shape propagation before it is applied.

field run_graph_cleanup: bool = True

Whether to run graph cleanup/canonicalization after this transform.

field run_per_gm: bool = True

Whether to run the transform per graph (sub)module or on whole module.

field run_shape_prop: bool = False

Whether to run shape propagation after this transform.

field shard_all_unprocessed: bool = False

When True, apply simple shard (column split + all_gather) to ‘leftover’ linear nodes that are not part of any layer subgraph.

field shard_layers: List[str] | None = None

When set, only shard nodes whose layer_type hint is in this list. Nodes with layer_type=’unknown’ or missing are NOT sharded. When None (default), all enable_sharding nodes are processed regardless of layer_type.

field sharding_dims: List[ShardingDim] [Optional]
field sharding_source: List[ShardingSource] [Optional]
field simple_shard_filter: str | None = None

Comma-separated list of substrings to filter which unprocessed linear nodes are simple-sharded. A node is included if its name contains ANY of the listed keywords. Example: ‘lm_head,shared_expert’. Only effective when shard_all_unprocessed is True. When None, all unprocessed linear nodes are sharded.

field simple_shard_only: bool = False
field skip_on_error: bool = False

Whether to skip the transform if an error occurs.

field stage: Stages [Required]

The stage of the transformation pipeline where this transform should run.

field support_partial_config: bool = True
validate_config(
sources: ShardingSource | List[ShardingSource] = None,
) bool[source]

Apply Sharding Hints#

Transform key: apply_sharding_hints

Source module: tensorrt_llm._torch.auto_deploy.transform.library.sharding_ir

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.library.sharding_ir.ApplyShardingHints(
config: TransformConfig,
)[source]#

Bases: BaseTransform

Deterministic, node-local sharding transform driven by hint kwargs.

Iterates graph nodes and applies sharding based on explicit hint arguments (tp_mode, tp_scaled_dim, tp_scale_sizes, etc.) together with the runtime DistConfig. No cross-node propagation, no topology inference.

classmethod get_config_class() Type[TransformConfig][source]#

Get the configuration class for the transform.

This is used to validate the configuration of the transform.

YAML configuration

The fields below can be set under this transform’s entry in the AutoDeploy config YAML.

pydantic model tensorrt_llm._torch.auto_deploy.transform.library.sharding_ir.IRShardingConfig[source]

Bases: TransformConfig

Minimal configuration for the hint-driven IR sharding transform.

This replaces the legacy ShardingTransformConfig for ApplyShardingHints, carrying only the fields that the IR path actually reads. When the legacy sharding path is removed, this is the only sharding config class.

Show JSON schema
{
   "title": "IRShardingConfig",
   "description": "Minimal configuration for the hint-driven IR sharding transform.\n\nThis replaces the legacy ``ShardingTransformConfig`` for\n``ApplyShardingHints``, carrying only the fields that the IR path actually\nreads.  When the legacy sharding path is removed, this is the only sharding\nconfig class.",
   "type": "object",
   "properties": {
      "stage": {
         "$ref": "#/$defs/Stages",
         "description": "The stage of the transformation pipeline where this transform should run."
      },
      "run_per_gm": {
         "default": true,
         "description": "Whether to run the transform per graph (sub)module or on whole module.",
         "title": "Run Per Gm",
         "type": "boolean"
      },
      "enabled": {
         "default": true,
         "description": "Whether to enable this transform.",
         "title": "Enabled",
         "type": "boolean"
      },
      "skip_on_error": {
         "default": false,
         "description": "Whether to skip the transform if an error occurs.",
         "title": "Skip On Error",
         "type": "boolean"
      },
      "run_graph_cleanup": {
         "default": true,
         "description": "Whether to run graph cleanup/canonicalization after this transform.",
         "title": "Run Graph Cleanup",
         "type": "boolean"
      },
      "run_shape_prop": {
         "default": false,
         "description": "Whether to run shape propagation after this transform.",
         "title": "Run Shape Prop",
         "type": "boolean"
      },
      "requires_clean_graph": {
         "default": true,
         "description": "Whether this transform requires the graph to be clean before it is applied.",
         "title": "Requires Clean Graph",
         "type": "boolean"
      },
      "requires_shape_prop": {
         "default": false,
         "description": "Whether this transform requires shape propagation before it is applied.",
         "title": "Requires Shape Prop",
         "type": "boolean"
      },
      "debug_visualize_dir": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Debug visualization directory. None to disable visualization, or a path string to specify the output directory.",
         "title": "Debug Visualize Dir"
      },
      "expect_mem_change": {
         "default": false,
         "description": "Whether this transform is expected to cause changes in CUDA memory stats.",
         "title": "Expect Mem Change",
         "type": "boolean"
      },
      "allreduce_strategy": {
         "$ref": "#/$defs/AllReduceStrategy",
         "default": 3,
         "description": "AllReduce strategy for distributed operations."
      },
      "simple_shard_only": {
         "default": false,
         "title": "Simple Shard Only",
         "type": "boolean"
      },
      "shard_layers": {
         "anyOf": [
            {
               "items": {
                  "type": "string"
               },
               "type": "array"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "When set, only shard nodes whose layer_type hint is in this list.",
         "title": "Shard Layers"
      },
      "simple_shard_filter": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Comma-separated weight-name keywords (e.g. 'lm_head'). Matching linears are gather-sharded (column split + all_gather) regardless of shard_layers -- used for the lm_head vocab projection, which the hint-driven sharder would otherwise replicate.",
         "title": "Simple Shard Filter"
      },
      "enable_attention_dp": {
         "default": false,
         "title": "Enable Attention Dp",
         "type": "boolean"
      },
      "dist_mapping": {
         "additionalProperties": {
            "type": "integer"
         },
         "title": "Dist Mapping",
         "type": "object"
      },
      "dist_config": {
         "$ref": "#/$defs/DistConfig"
      }
   },
   "$defs": {
      "AllReduceStrategy": {
         "enum": [
            0,
            1,
            2,
            3,
            4,
            5,
            6,
            7,
            8,
            9
         ],
         "title": "AllReduceStrategy",
         "type": "integer"
      },
      "DistConfig": {
         "additionalProperties": true,
         "description": "Distributed parallelism configuration for AutoDeploy.",
         "properties": {
            "world_size": {
               "default": 1,
               "minimum": 1,
               "title": "World Size",
               "type": "integer"
            },
            "rank": {
               "default": 0,
               "minimum": 0,
               "title": "Rank",
               "type": "integer"
            },
            "tp_size": {
               "default": 1,
               "minimum": 1,
               "title": "Tp Size",
               "type": "integer"
            },
            "pp_size": {
               "default": 1,
               "minimum": 1,
               "title": "Pp Size",
               "type": "integer"
            },
            "moe_tp_size": {
               "default": 1,
               "minimum": 1,
               "title": "Moe Tp Size",
               "type": "integer"
            },
            "moe_ep_size": {
               "default": 1,
               "minimum": 1,
               "title": "Moe Ep Size",
               "type": "integer"
            },
            "moe_cluster_size": {
               "default": 1,
               "minimum": 1,
               "title": "Moe Cluster Size",
               "type": "integer"
            },
            "enable_attention_dp": {
               "default": false,
               "title": "Enable Attention Dp",
               "type": "boolean"
            },
            "allreduce_strategy": {
               "default": "NCCL",
               "title": "Allreduce Strategy",
               "type": "string"
            }
         },
         "title": "DistConfig",
         "type": "object"
      },
      "Stages": {
         "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.",
         "enum": [
            "factory",
            "export",
            "post_export",
            "pattern_matcher",
            "sharding",
            "weight_load",
            "post_load_fusion",
            "cache_init",
            "visualize",
            "compile"
         ],
         "title": "Stages",
         "type": "string"
      }
   },
   "additionalProperties": true,
   "required": [
      "stage"
   ]
}

Config:
  • extra: str = allow

Fields:
  • allreduce_strategy (tensorrt_llm.functional.AllReduceStrategy)

  • dist_config (tensorrt_llm._torch.auto_deploy.utils.dist_config.DistConfig)

  • dist_mapping (dict[str, int])

  • enable_attention_dp (bool)

  • shard_layers (List[str] | None)

  • simple_shard_filter (str | None)

  • simple_shard_only (bool)

Validators:
  • _validate_allreduce_strategy » allreduce_strategy

field allreduce_strategy: AllReduceStrategy = AllReduceStrategy.AUTO

AllReduce strategy for distributed operations.

field dist_config: DistConfig [Optional]
field dist_mapping: dict[str, int] [Optional]
field enable_attention_dp: bool = False
field shard_layers: List[str] | None = None

When set, only shard nodes whose layer_type hint is in this list.

field simple_shard_filter: str | None = None

Comma-separated weight-name keywords (e.g. ‘lm_head’). Matching linears are gather-sharded (column split + all_gather) regardless of shard_layers – used for the lm_head vocab projection, which the hint-driven sharder would otherwise replicate.

field simple_shard_only: bool = False

Pipeline Cache#

Transform key: pipeline_cache

Source module: tensorrt_llm._torch.auto_deploy.transform.pipeline_cache.pipeline_cache

Configured modes: graph

class tensorrt_llm._torch.auto_deploy.transform.pipeline_cache.pipeline_cache.PipelineCache(
config: TransformConfig,
)[source]#

Bases: BaseTransform

Transform that snapshots/restores the model at its configured pipeline position.

classmethod get_config_class() type[TransformConfig][source]#

Get the configuration class for the transform.

This is used to validate the configuration of the transform.

maybe_restore(
_cm: CachedSequenceInterface,
factory: ModelFactory,
shared_config: SharedConfig,
transform_index: int,
) Module | None[source]#

Return a cached module for this transform point, or None on a miss.

YAML configuration

The fields below can be set under this transform’s entry in the AutoDeploy config YAML.

pydantic model tensorrt_llm._torch.auto_deploy.transform.pipeline_cache.pipeline_cache.PipelineCacheConfig[source]

Bases: TransformConfig

Configuration for the torch-save pipeline cache transform.

Show JSON schema
{
   "title": "PipelineCacheConfig",
   "description": "Configuration for the torch-save pipeline cache transform.",
   "type": "object",
   "properties": {
      "stage": {
         "$ref": "#/$defs/Stages",
         "description": "The stage of the transformation pipeline where this transform should run."
      },
      "enabled": {
         "default": false,
         "description": "Whether to enable the torch-save pipeline cache transform.",
         "title": "Enabled",
         "type": "boolean"
      },
      "root": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Cache root. Defaults to ~/.cache/tensorrt_llm/auto_deploy/pipeline_cache when the transform is enabled.",
         "title": "Root"
      }
   },
   "$defs": {
      "Stages": {
         "description": "Enumerated (ordered!) stages of the transformation pipeline.\n\nThis is used to classify and pre-order transforms.",
         "enum": [
            "factory",
            "export",
            "post_export",
            "pattern_matcher",
            "sharding",
            "weight_load",
            "post_load_fusion",
            "cache_init",
            "visualize",
            "compile"
         ],
         "title": "Stages",
         "type": "string"
      }
   },
   "additionalProperties": false,
   "required": [
      "stage"
   ]
}

Config:
  • extra: str = forbid

Fields:
  • enabled (bool)

  • root (str | None)

Validators:
  • validate_enabled_cache » all fields

field enabled: bool = False

Whether to enable the torch-save pipeline cache transform.

field root: str | None = None

Cache root. Defaults to ~/.cache/tensorrt_llm/auto_deploy/pipeline_cache when the transform is enabled.

validator validate_enabled_cache  »  all fields[source]
debug_visualize_dir: ClassVar[str | None] = None
expect_mem_change: ClassVar[bool] = False
requires_clean_graph: ClassVar[bool] = False
requires_shape_prop: ClassVar[bool] = False
run_graph_cleanup: ClassVar[bool] = False
run_per_gm: ClassVar[bool] = False
run_shape_prop: ClassVar[bool] = False
skip_on_error: ClassVar[bool] = True