model_config_export
Code that export optimized models to the TensorRT-LLM checkpoint.
Functions
Exports the torch model to the TensorRT-LLM checkpoint and save to the export_dir. |
|
Converts the torch model to the TensorRT-LLM checkpoint per GPU rank. |
- export_tensorrt_llm_checkpoint(model, decoder_type, dtype=torch.float16, export_dir='/tmp', inference_tensor_parallel=0, inference_pipeline_parallel=1, export_npz=False, naive_fp8_quantization=False, use_nfs_workspace=False)
Exports the torch model to the TensorRT-LLM checkpoint and save to the export_dir.
- Parameters:
model (Module) – the torch model.
decoder_type (str) – the type of the decoder, e.g. gpt2, gptj, llama or gptnext.
dtype (dtype) – the weights data type to export the unquantized layers.
export_dir (Path | str) – the target export path.
inference_tensor_parallel (int) – The target inference time tensor parallel. We will merge or split the calibration tensor parallelism to inference. Default is 0, meaning using the calibration without manual config merge or split.
inference_pipeline_parallel (int) – The target inference time pipeline parallel. We will merge or split the calibration pipeline parallelism to inference. Default is 1, meaning no pipeline parallelism.
inference_pipeline_parallel – The target inference time pipeline parallel.
export_npz (bool) – Whether or not to export the model_config to the old NPZ format for backward compatibility.
naive_fp8_quantization (bool) – Quantize the model naively to FP8 without calibration. All scaling factors are set to 1.
use_nfs_workspace (bool) – if True, the an NFS workspace will be created under the export_dir and used as a shared memory for cross process/node communication.
For tensorrt_llm deployment, save the representation under
export_dir
. We will save the model_config as two files:.json
: The nested dict that maps to thePretrainedConfig
in TensorRT-LLM.https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/models/modeling_utils.py.
.safetensors
: The file for the list of weights as safetensors. Unique for each rank.
- torch_to_tensorrt_llm_checkpoint(model, decoder_type, dtype=torch.float16, inference_tensor_parallel=0, inference_pipeline_parallel=1, export_npz=False, naive_fp8_quantization=False, workspace_path=None)
Converts the torch model to the TensorRT-LLM checkpoint per GPU rank.
TensorRT-LLM checkpoint is the LLM model format that can be used by the TensorRT-LLM build API. for the engine building process. https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/architecture/checkpoint.md
- Parameters:
model (Module) – the torch model.
decoder_type (str) – the type of the decoder, e.g. gpt2, gptj, llama or gptnext.
dtype (dtype) – the weights data type to export the unquantized layers.
inference_tensor_parallel (int) – The target inference time tensor parallel. We will merge or split the calibration tensor parallelism to inference. Default is 0, meaning using the calibration without manual config merge or split.
inference_pipeline_parallel (int) – The target inference time pipeline parallel. We will merge or split the calibration pipeline parallelism to inference. Default is 1, meaning no pipeline parallelism.
export_npz (bool) – Whether or not to export the model_config to the old NPZ format for backward compatibility.
naive_fp8_quantization (bool) – Quantize the model naively to FP8 without calibration. All scaling factors are set to 1.
workspace_path (Path | str | None) – the path to the NFS directory for postprocess cross rank communication.
- Yields:
- A tuple of
tensorrt_llm_config: A dict that maps to the
PretrainedConfig
in TensorRT-LLM. https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/models/modeling_utils.py weights: A dict that stores all model weights and scaling factors for each rank. per_layer_quantization: A dict that contains layer-wise quantization information for all quantized layers for mixed_precision, empty dictionary otherwise.
- Return type:
Iterator[Tuple[Dict[str, Any], Dict[str, Tensor], Dict[str, Any]]]