TensorRT-LLM Deployment

Note

Please read the TensorRT-LLM checkpoint workflow first before going through this section.

ModelOpt toolkit supports automatic conversion of ModelOpt exported LLM to the TensorRT-LLM checkpoint and the engines for accelerated inferencing.

This conversion is achieved by:

Converting Huggingface, NeMo and ModelOpt exported checkpoints to the TensorRT-LLM checkpoint.
Building TensorRT-LLM engine from the TensorRT-LLM checkpoint.

Export Quantized Model

After the model is quantized, the quantized model can be exported to the TensorRT-LLM checkpoint format stored as

A single JSON file recording the model structure and metadata (config.json)
A group of safetensors files, each recording the local calibrated model on a single GPU rank (model weights, scaling factors per GPU).

The export API (export_tensorrt_llm_checkpoint) can be used as follows:

from modelopt.torch.export import export_tensorrt_llm_checkpoint

with torch.inference_mode():
    export_tensorrt_llm_checkpoint(
        model,  # The quantized model.
        decoder_type,  # The type of the model as str, e.g gpt, gptj, llama.
        dtype,  # the weights data type to export the unquantized layers.
        export_dir,  # The directory where the exported files will be stored.
        inference_tensor_parallel,  # The number of GPUs used in the inference time for tensor parallelism.
        inference_pipeline_parallel,  # The number of GPUs used in the inference time for pipeline parallelism.
    )

If the export_tensorrt_llm_checkpoint call is successful, the TensorRT-LLM checkpoint will be saved. Otherwise, e.g. the decoder_type is not supported, a torch state_dict checkpoint will be saved instead.

Model support matrix for the TensorRT-LLM checkpoint export
Model / Quantization	FP16 / BF16	FP8	INT8_SQ	INT4_AWQ
GPT2	Yes	Yes	Yes	No
GPTJ	Yes	Yes	Yes	Yes
LLAMA 2	Yes	Yes	Yes	Yes
LLAMA 3	Yes	Yes	No	Yes
Mistral	Yes	Yes	Yes	Yes
Mixtral 8x7B	Yes	Yes	No	Yes
Falcon 40B, 180B	Yes	Yes	Yes	Yes
Falcon 7B	Yes	Yes	Yes	No
MPT 7B, 30B	Yes	Yes	Yes	Yes
Baichuan 1, 2	Yes	Yes	Yes	Yes
ChatGLM2, 3 6B	Yes	No	No	Yes
Bloom	Yes	Yes	Yes	Yes
Phi-1, 2, 3	Yes	Yes	Yes	Yes
Nemotron 8	Yes	Yes	No	Yes
Gemma 2B, 7B	Yes	Yes	No	Yes
Recurrent Gemma	Yes	Yes	Yes	Yes
StarCoder 2	Yes	Yes	Yes	Yes
Qwen-1, 1.5	Yes	Yes	Yes	Yes

Convert to TensorRT-LLM

Once the TensorRT-LLM checkpoint is available, please follow the TensorRT-LLM build API to build and deploy the quantized LLM.