========================== TensorRT-LLM ========================== .. note:: Please read the `TensorRT-LLM checkpoint workflow `_ first before going through this section. ModelOpt toolkit supports automatic conversion of ModelOpt exported LLM to the TensorRT-LLM checkpoint and the engines for accelerated inferencing. This conversion is achieved by: #. Converting Huggingface, NeMo and ModelOpt exported checkpoints to the TensorRT-LLM checkpoint. #. Building TensorRT-LLM engine from the TensorRT-LLM checkpoint. Export Quantized Model ====================== After the model is quantized, the quantized model can be exported to the TensorRT-LLM checkpoint format stored as #. A single JSON file recording the model structure and metadata (config.json) #. A group of safetensors files, each recording the local calibrated model on a single GPU rank (model weights, scaling factors per GPU). The export API (:meth:`export_tensorrt_llm_checkpoint `) can be used as follows: .. code-block:: python from modelopt.torch.export import export_tensorrt_llm_checkpoint with torch.inference_mode(): export_tensorrt_llm_checkpoint( model, # The quantized model. decoder_type, # The type of the model as str, e.g gpt, gptj, llama. dtype, # the weights data type to export the unquantized layers. export_dir, # The directory where the exported files will be stored. inference_tensor_parallel, # The number of GPUs used in the inference time for tensor parallelism. inference_pipeline_parallel, # The number of GPUs used in the inference time for pipeline parallelism. ) If the :meth:`export_tensorrt_llm_checkpoint ` call is successful, the TensorRT-LLM checkpoint will be saved. Otherwise, e.g. the ``decoder_type`` is not supported, a torch state_dict checkpoint will be saved instead. .. list-table:: Model support matrix for the TensorRT-LLM checkpoint export :header-rows: 1 * - Model / Quantization - FP16 / BF16 - FP8 - INT8_SQ - INT4_AWQ * - GPT2 - Yes - Yes - Yes - No * - GPTJ - Yes - Yes - Yes - Yes * - LLAMA 2 - Yes - Yes - Yes - Yes * - LLAMA 3 - Yes - Yes - No - Yes * - Mistral - Yes - Yes - Yes - Yes * - Mixtral 8x7B - Yes - Yes - No - Yes * - Falcon 40B, 180B - Yes - Yes - Yes - Yes * - Falcon 7B - Yes - Yes - Yes - No * - MPT 7B, 30B - Yes - Yes - Yes - Yes * - Baichuan 1, 2 - Yes - Yes - Yes - Yes * - ChatGLM2, 3 6B - Yes - No - No - Yes * - Bloom - Yes - Yes - Yes - Yes * - Phi-1, 2, 3 - Yes - Yes - Yes - Yes * - Nemotron 8 - Yes - Yes - No - Yes * - Gemma 2B, 7B - Yes - Yes - No - Yes * - Recurrent Gemma - Yes - Yes - Yes - Yes * - StarCoder 2 - Yes - Yes - Yes - Yes * - Qwen-1, 1.5 - Yes - Yes - Yes - Yes Convert to TensorRT-LLM ======================= Once the TensorRT-LLM checkpoint is available, please follow the `TensorRT-LLM build API `_ to build and deploy the quantized LLM.