Incorporating auto_deploy into your own workflow#

AutoDeploy can be seamlessly integrated into existing workflows using TRT-LLM’s LLM high-level API. This section provides an example for configuring and invoking AutoDeploy in custom applications.

The following example demonstrates how to build an LLM object with AutoDeploy integration:

from tensorrt_llm._torch.auto_deploy import LLM


# Construct the LLM high-level interface object with autodeploy as backend
llm = LLM(
    model=<HF_MODEL_CARD_OR_DIR>,
    world_size=<DESIRED_WORLD_SIZE>,
    model_factory="AutoModelForCausalLM", # choose appropriate model factory
    model_kwargs={"num_hidden_layers": 2}, # test with smaller model configuration
    transforms={
        "insert_cached_attention": {"backend": "flashinfer"},  # or "triton"
        "insert_cached_mla_attention": {"backend": "MultiHeadLatentAttention"},
        "resize_kv_cache": {"free_mem_ratio": 0.8},
        "compile_model": {"backend": "torch-compile"},
        "detect_sharding": {"simple_shard_only": False},

    },
    attn_page_size=64, # page size for attention
    skip_loading_weights=False,
    max_seq_len=<MAX_SEQ_LEN>,
    max_batch_size=<MAX_BATCH_SIZE>,
)

For more information about configuring AutoDeploy via the LLM API using **kwargs, see the AutoDeploy LLM API in tensorrt_llm._torch.auto_deploy.llm and the AutoDeployConfig class in tensorrt_llm._torch.auto_deploy.llm_args.