model_config_export

Code that export optimized models to the TensorRT-LLM checkpoint.

Functions

`export_tensorrt_llm_checkpoint`	Exports the torch model to the TensorRT-LLM checkpoint and save to the export_dir.
`torch_to_tensorrt_llm_checkpoint`	Converts the torch model to the TensorRT-LLM checkpoint per GPU rank.

export_tensorrt_llm_checkpoint(model, decoder_type, dtype=None, export_dir='/tmp', inference_tensor_parallel=0, inference_pipeline_parallel=1, use_nfs_workspace=False)

Exports the torch model to the TensorRT-LLM checkpoint and save to the export_dir.

Parameters:

model (Module) – the torch model.
decoder_type (str) – the type of the decoder, e.g. gpt, gptj, llama. Please see the model_utils.py for the supported models.
dtype (dtype | None) – the weights data type to export the unquantized layers or the default model data type if None.
export_dir (Path | str) – the target export path.
inference_tensor_parallel (int) – The target inference time tensor parallel. We will merge or split the calibration tensor parallelism to inference. Default is 0, meaning using the calibration without manual config merge or split.
inference_pipeline_parallel (int) – The target inference time pipeline parallel. We will merge or split the calibration pipeline parallelism to inference. Default is 1, meaning no pipeline parallelism.
inference_pipeline_parallel – The target inference time pipeline parallel.
use_nfs_workspace (bool) – if True, the an NFS workspace will be created under the export_dir and used as a shared memory for cross process/node communication.

For tensorrt_llm deployment, save the representation under export_dir. We will save the model_config as two files:

.json: The nested dict that maps to the PretrainedConfig in TensorRT-LLM.
https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/models/modeling_utils.py.

.safetensors: The file for the list of weights as safetensors. Unique for each rank.

torch_to_tensorrt_llm_checkpoint(model, decoder_type, dtype=None, inference_tensor_parallel=0, inference_pipeline_parallel=1, workspace_path=None)

Converts the torch model to the TensorRT-LLM checkpoint per GPU rank.

TensorRT-LLM checkpoint is the LLM model format that can be used by the TensorRT-LLM build API. for the engine building process. https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/architecture/checkpoint.md

Parameters:

model (Module) – the torch model.
decoder_type (str) – the type of the decoder, e.g. gpt, gptj, llama. Please see modelopt.torch.export.model_utils for the supported models.
dtype (dtype | None) – the weights data type to export the unquantized layers or the default model data type if None.
inference_tensor_parallel (int) – The target inference time tensor parallel. We will merge or split the calibration tensor parallelism to inference. Default is 0, meaning using the calibration without manual config merge or split.
inference_pipeline_parallel (int) – The target inference time pipeline parallel. We will merge or split the calibration pipeline parallelism to inference. Default is 1, meaning no pipeline parallelism.
workspace_path (Path | str | None) – the path to the NFS directory for postprocess cross rank communication.

Yields:

A tuple of: tensorrt_llm_config: A dict that maps to the PretrainedConfig in TensorRT-LLM. https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/models/modeling_utils.py weights: A dict that stores all model weights and scaling factors for each rank. per_layer_quantization: A dict that contains layer-wise quantization information for all quantized layers for mixed_precision, empty dictionary otherwise.

Return type:

Iterator[tuple[dict[str, Any], dict[str, Tensor], dict[str, Any]]]