model_config_export

Code that export optimized models to the TensorRT-LLM checkpoint.

Functions

export_tensorrt_llm_checkpoint

Exports the torch model to the TensorRT-LLM checkpoint and save to the export_dir.

torch_to_tensorrt_llm_checkpoint

Converts the torch model to the TensorRT-LLM checkpoint per GPU rank.

export_tensorrt_llm_checkpoint(model, decoder_type, dtype=torch.float16, export_dir='/tmp', inference_tensor_parallel=0, inference_pipeline_parallel=1, export_npz=False, naive_fp8_quantization=False, use_nfs_workspace=False)

Exports the torch model to the TensorRT-LLM checkpoint and save to the export_dir.

Parameters:
  • model (Module) – the torch model.

  • decoder_type (str) – the type of the decoder, e.g. gpt2, gptj, llama or gptnext.

  • dtype (dtype) – the weights data type to export the unquantized layers.

  • export_dir (Path | str) – the target export path.

  • inference_tensor_parallel (int) – The target inference time tensor parallel. We will merge or split the calibration tensor parallelism to inference. Default is 0, meaning using the calibration without manual config merge or split.

  • inference_pipeline_parallel (int) – The target inference time pipeline parallel. We will merge or split the calibration pipeline parallelism to inference. Default is 1, meaning no pipeline parallelism.

  • inference_pipeline_parallel – The target inference time pipeline parallel.

  • export_npz (bool) – Whether or not to export the model_config to the old NPZ format for backward compatibility.

  • naive_fp8_quantization (bool) – Quantize the model naively to FP8 without calibration. All scaling factors are set to 1.

  • use_nfs_workspace (bool) – if True, the an NFS workspace will be created under the export_dir and used as a shared memory for cross process/node communication.

For tensorrt_llm deployment, save the representation under export_dir. We will save the model_config as two files:

torch_to_tensorrt_llm_checkpoint(model, decoder_type, dtype=torch.float16, inference_tensor_parallel=0, inference_pipeline_parallel=1, export_npz=False, naive_fp8_quantization=False, workspace_path=None)

Converts the torch model to the TensorRT-LLM checkpoint per GPU rank.

TensorRT-LLM checkpoint is the LLM model format that can be used by the TensorRT-LLM build API. for the engine building process. https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/architecture/checkpoint.md

Parameters:
  • model (Module) – the torch model.

  • decoder_type (str) – the type of the decoder, e.g. gpt2, gptj, llama or gptnext.

  • dtype (dtype) – the weights data type to export the unquantized layers.

  • inference_tensor_parallel (int) – The target inference time tensor parallel. We will merge or split the calibration tensor parallelism to inference. Default is 0, meaning using the calibration without manual config merge or split.

  • inference_pipeline_parallel (int) – The target inference time pipeline parallel. We will merge or split the calibration pipeline parallelism to inference. Default is 1, meaning no pipeline parallelism.

  • export_npz (bool) – Whether or not to export the model_config to the old NPZ format for backward compatibility.

  • naive_fp8_quantization (bool) – Quantize the model naively to FP8 without calibration. All scaling factors are set to 1.

  • workspace_path (Path | str | None) – the path to the NFS directory for postprocess cross rank communication.

Yields:
A tuple of

tensorrt_llm_config: A dict that maps to the PretrainedConfig in TensorRT-LLM. https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/models/modeling_utils.py weights: A dict that stores all model weights and scaling factors for each rank. per_layer_quantization: A dict that contains layer-wise quantization information for all quantized layers for mixed_precision, empty dictionary otherwise.

Return type:

Iterator[Tuple[Dict[str, Any], Dict[str, Tensor], Dict[str, Any]]]