Changelog

Model Optimizer Changelog

Backward Breaking Changes

New Features

Adding TensorRT-LLM checkpoint export support for Medusa decoding (official MedusaModel and Megatron Core GPTModel).
Enable support for mixtral, recurrentgemma, starcoder, qwen in PTQ examples.
Adding TensorRT-LLM checkpoint export and engine building support for sparse models.
Import scales from TensorRT calibration cache and use them for quantization.
(Experimental) Enable low GPU memory FP8 calibration for the Hugging Face models when the original model size does not fit into the GPU memory.
(Experimental) Support exporting FP8 calibrated model to VLLM deployment.
(Experimental) Python 3.12 support added.

Backward Breaking Changes

[!!!] The package was renamed from ammo to modelopt. The new full product name is Nvidia TensorRT Model Optimizer. PLEASE CHANGE ALL YOUR REFERENCES FROM ammo to modelopt including any paths and links!
Default installation pip install nvidia-modelopt will now only install minimal core dependencies. Following optional dependencies are available depending on the features that are being used: [deploy], [onnx], [torch], [hf]. To install all dependencies, use pip install "nvidia-modelopt[all]".
Deprecated inference_gpus arg in modelopt.torch.export.model_config_export.torch_to_tensorrt_llm_checkpoint. User should use inference_tensor_parallel instead.
Experimental modelopt.torch.deploy module is now available as modelopt.torch._deploy.

New Features

modelopt.torch.sparsity now supports sparsity-aware training (SAT). Both SAT and post-training sparsification supports chaining with other modes, e.g. SAT + QAT.
modelopt.torch.quantization natively support distributed data and tensor parallelism while estimating quantization parameters. The data and tensor parallel groups needs to be registered with modelopt.torch.utils.distributed.set_data_parallel_group and modelopt.torch.utils.distributed.set_tensor_parallel_group APIs. By default, the data parallel group is set as the default distributed group and the tensor parallel group is disabled.
modelopt.torch.opt now supports chaining multiple optimization techniques that each require modifications to the same model, e.g., you can now sparsify and quantize a model at the same time.
modelopt.onnx.quantization supports FLOAT8 quantization format with Distribution calibration algorithm.
Native support of modelopt.torch.opt with FSDP (Fully Sharded Data Parallel) for torch>=2.1. This includes sparsity, quantization, and any other model modification & optimization.
Added FP8 ONNX quantization support in modelopt.onnx.quantization.
Added Windows (win_amd64) support for ModelOpt released wheels. Currently supported for modelopt.onnx submodule only.

Bug Fixes

Fixed the compatibility issue of modelopt.torch.sparsity with FSDP.
Fixed an issue in dynamic dim handling in modelopt.onnx.quantization with random calibration data.
Fixed graph node naming issue after opset convertion operation.
Fixed an issue in negative dim handling like dynamic dim in modelopt.onnx.quantization with random calibration data.
Fixed allowing to accept .pb file for input file.
Fixed copy extra data to tmp folder issue for ONNX PTQ.