TensorRT-LLM Deployment

Note

Please read the TensorRT-LLM checkpoint workflow first before going through this section.

ModelOpt toolkit supports automatic conversion of ModelOpt exported LLM to the TensorRT-LLM checkpoint and the engines for accelerated inferencing.

This conversion is achieved by:

  1. Converting Huggingface, NeMo and ModelOpt exported checkpoints to the TensorRT-LLM checkpoint.

  2. Building TensorRT-LLM engine from the TensorRT-LLM checkpoint.

Export Quantized Model

After the model is quantized, the quantized model can be exported to the TensorRT-LLM checkpoint format stored as

  1. A single JSON file recording the model structure and metadata (config.json)

  2. A group of safetensors files, each recording the local calibrated model on a single GPU rank (model weights, scaling factors per GPU).

The export API (export_tensorrt_llm_checkpoint) can be used as follows:

from modelopt.torch.export import export_tensorrt_llm_checkpoint

with torch.inference_mode():
    export_tensorrt_llm_checkpoint(
        model,  # The quantized model.
        decoder_type,  # The type of the model as str, e.g gpt, gptj, llama.
        dtype,  # the weights data type to export the unquantized layers.
        export_dir,  # The directory where the exported files will be stored.
        inference_tensor_parallel,  # The number of GPUs used in the inference time for tensor parallelism.
        inference_pipeline_parallel,  # The number of GPUs used in the inference time for pipeline parallelism.
    )

If the export_tensorrt_llm_checkpoint call is successful, the TensorRT-LLM checkpoint will be saved. Otherwise, e.g. the decoder_type is not supported, a torch state_dict checkpoint will be saved instead.

Model support matrix for the TensorRT-LLM checkpoint export

Model / Quantization

FP16 / BF16

FP8

INT8_SQ

INT4_AWQ

GPT2

Yes

Yes

Yes

No

GPTJ

Yes

Yes

Yes

Yes

LLAMA 2

Yes

Yes

Yes

Yes

LLAMA 3

Yes

Yes

No

Yes

Mistral

Yes

Yes

Yes

Yes

Mixtral 8x7B

Yes

Yes

No

Yes

Falcon 40B, 180B

Yes

Yes

Yes

Yes

Falcon 7B

Yes

Yes

Yes

No

MPT 7B, 30B

Yes

Yes

Yes

Yes

Baichuan 1, 2

Yes

Yes

Yes

Yes

ChatGLM2, 3 6B

Yes

No

No

Yes

Bloom

Yes

Yes

Yes

Yes

Phi-1, 2, 3

Yes

Yes

Yes

Yes

Nemotron 8

Yes

Yes

No

Yes

Gemma 2B, 7B

Yes

Yes

No

Yes

Recurrent Gemma

Yes

Yes

Yes

Yes

StarCoder 2

Yes

Yes

Yes

Yes

Qwen-1, 1.5

Yes

Yes

Yes

Yes

Convert to TensorRT-LLM

Once the TensorRT-LLM checkpoint is available, please follow the TensorRT-LLM build API to build and deploy the quantized LLM.