LLM API Introduction#

The LLM API is a high-level Python API designed to streamline LLM inference workflows.

It supports a broad range of use cases, from single-GPU setups to multi-GPU and multi-node deployments, with built-in support for various parallelism strategies and advanced features. The LLM API integrates seamlessly with the broader inference ecosystem, including NVIDIA Dynamo and the Triton Inference Server.

While the LLM API simplifies inference workflows with a high-level interface, it is also designed with flexibility in mind. Under the hood, it uses a PyTorch-native and modular backend, making it easy to customize, extend, or experiment with the runtime.

Supported Models#

DeepSeek variants
Llama (including variants Mistral, Mixtral, InternLM)
GPT (including variants Starcoder-1/2, Santacoder)
Gemma-1/2/3
Phi-1/2/3/4
ChatGLM (including variants glm-10b, chatglm, chatglm2, chatglm3, glm4)
QWen-1/1.5/2/3
Falcon
Baichuan-1/2
GPT-J
Mamba-1/2

Note: For the most up-to-date list of supported models, you may refer to the TensorRT-LLM model definitions.

Quick Start Example#

A simple inference example with TinyLlama using the LLM API:

More examples can be found here.

Model Input#

The LLM() constructor accepts either a Hugging Face model ID or a local model path as input.

1. Using a Model from the Hugging Face Hub#

To load a model directly from the Hugging Face Model Hub, simply pass its model ID (i.e., repository name) to the LLM constructor. The model will be automatically downloaded:

llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")

You can also use quantized checkpoints (FP4, FP8, etc) of popular models provided by NVIDIA in the same way.

2. Using a Local Hugging Face Model#

To use a model from local storage, first download it manually:

git lfs install
git clone https://huggingface.co/meta-llama/Meta-Llama-3.1-8B

Then, load the model by specifying a local directory path:

llm = LLM(model=<local_path_to_model>)

Note: Some models require accepting specific license agreements. Make sure you have agreed to the terms and authenticated with Hugging Face before downloading.

Tips and Troubleshooting#

The following tips typically assist new LLM API users who are familiar with other APIs that are part of TensorRT-LLM:

RuntimeError: only rank 0 can start multi-node session, got 1

There is no need to add an mpirun prefix for launching single node multi-GPU inference with the LLM API.

For example, you can run python llm_inference_distributed.py to perform multi-GPU on a single node.
Hang issue on Slurm Node

If you experience a hang or other issue on a node managed with Slurm, add prefix mpirun -n 1 --oversubscribe --allow-run-as-root to your launch script.

For example, try mpirun -n 1 --oversubscribe --allow-run-as-root python llm_inference_distributed.py.
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD with errorcode 1.

Because the LLM API relies on the mpi4py library, put the LLM class in a function and protect the main entrypoint to the program under the __main__ namespace to avoid a recursive spawn process in mpi4py.

This limitation is applicable for multi-GPU inference only.
Cannot quit after generation

The LLM instance manages threads and processes, which may prevent its reference count from reaching zero. To address this issue, there are two common solutions:
1. Wrap the LLM instance in a function, as demonstrated in the quickstart guide. This will reduce the reference count and trigger the shutdown process.
2. Use LLM as an contextmanager, with the following code: with LLM(...) as llm: ..., the shutdown methed will be invoked automatically once it goes out of the with-statement block.