Changelog
Model Optimizer Changelog
0.13 (2024-06-14)
Backward Breaking Changes
PTQ examples have been upgraded to use TensorRT-LLM 0.10.
New Features
Adding TensorRT-LLM checkpoint export support for Medusa decoding (official
MedusaModel
and Megatron CoreGPTModel
).Enable support for mixtral, recurrentgemma, starcoder, qwen in PTQ examples.
Adding TensorRT-LLM checkpoint export and engine building support for sparse models.
Import scales from TensorRT calibration cache and use them for quantization.
(Experimental) Enable low GPU memory FP8 calibration for the Hugging Face models when the original model size does not fit into the GPU memory.
(Experimental) Support exporting FP8 calibrated model to VLLM deployment.
(Experimental) Python 3.12 support added.
0.11 (2024-05-07)
Backward Breaking Changes
[!!!] The package was renamed from
ammo
tomodelopt
. The new full product name is Nvidia TensorRT Model Optimizer. PLEASE CHANGE ALL YOUR REFERENCES FROMammo
tomodelopt
including any paths and links!Default installation
pip install nvidia-modelopt
will now only install minimal core dependencies. Following optional dependencies are available depending on the features that are being used:[deploy], [onnx], [torch], [hf]
. To install all dependencies, usepip install "nvidia-modelopt[all]"
.Deprecated
inference_gpus
arg inmodelopt.torch.export.model_config_export.torch_to_tensorrt_llm_checkpoint
. User should useinference_tensor_parallel
instead.Experimental
modelopt.torch.deploy
module is now available asmodelopt.torch._deploy
.
New Features
modelopt.torch.sparsity
now supports sparsity-aware training (SAT). Both SAT and post-training sparsification supports chaining with other modes, e.g. SAT + QAT.modelopt.torch.quantization
natively support distributed data and tensor parallelism while estimating quantization parameters. The data and tensor parallel groups needs to be registered withmodelopt.torch.utils.distributed.set_data_parallel_group
andmodelopt.torch.utils.distributed.set_tensor_parallel_group
APIs. By default, the data parallel group is set as the default distributed group and the tensor parallel group is disabled.modelopt.torch.opt
now supports chaining multiple optimization techniques that each require modifications to the same model, e.g., you can now sparsify and quantize a model at the same time.modelopt.onnx.quantization
supports FLOAT8 quantization format with Distribution calibration algorithm.Native support of
modelopt.torch.opt
with FSDP (Fully Sharded Data Parallel) fortorch>=2.1
. This includes sparsity, quantization, and any other model modification & optimization.Added FP8 ONNX quantization support in
modelopt.onnx.quantization
.Added Windows (
win_amd64
) support for ModelOpt released wheels. Currently supported formodelopt.onnx
submodule only.
Bug Fixes
Fixed the compatibility issue of
modelopt.torch.sparsity
with FSDP.Fixed an issue in dynamic dim handling in
modelopt.onnx.quantization
with random calibration data.Fixed graph node naming issue after opset convertion operation.
Fixed an issue in negative dim handling like dynamic dim in
modelopt.onnx.quantization
with random calibration data.Fixed allowing to accept
.pb
file for input file.Fixed copy extra data to tmp folder issue for ONNX PTQ.