Model Optimizer Changelog (Linux)
0.39 (2025-11-07)
New Features
Add flag
op_types_to_exclude_fp16in ONNX quantization to exclude ops from being converted to FP16/BF16. Alternatively, for custom TensorRT ops, this can also be done by indicating'fp32'precision intrt_plugins_precision.Add LoRA mode support for MCore in a new peft submodule:
modelopt.torch.peft.update_model(model, LORA_CFG).Support PTQ and fakequant in vLLM for fast evaluation of arbitrary quantization formats. See
examples/vllm_servefor more details.Add support for
nemotron-post-training-dataset-v2andnemotron-post-training-dataset-v1inexamples/llm_ptq. Default to a mix ofcnn_dailymailandnemotron-post-training-dataset-v2(gated dataset accessed usingHF_TOKENenvironment variable) if no dataset is specified.Allow specifying
calib_seqinexamples/llm_ptqto set the maximum sequence length for calibration.Add support for MCore MoE PTQ/QAT/QAD.
Add support for multi-node PTQ and export with FSDP2 in
examples/llm_ptq/multinode_ptq.py. See examples/llm_ptq/README.md for more details.Add support for Nemotron Nano VL v1 & v2 models in FP8/NVFP4 PTQ workflow.
Add flags
nodes_to_includeandop_types_to_includein AutoCast to force-include nodes in low precision, even if they would otherwise be excluded by other rules.
Documentation
Add general guidelines for Minitron pruning and distillation. See examples/pruning/README.md for more details.
0.37 (2025-10-08)
Deprecations
Deprecated ModelOpt’s custom docker images. Please use the PyTorch, TensorRT-LLM or TensorRT docker image directly or refer to the installation guide for more details.
Deprecated
quantize_modeargument inexamples/onnx_ptq/evaluate.pyto support strongly typing. Useengine_precisioninstead.Deprecated TRT-LLM’s TRT backend in
examples/llm_ptqandexamples/vlm_ptq. Tasksbuildandbenchmarksupport are removed and replaced withquant.engine_diris replaced withcheckpoint_dirinexamples/llm_ptqandexamples/vlm_ptq. For performance evaluation, please usetrtllm-benchdirectly.--export_fmtflag inexamples/llm_ptqis removed. By default we export to the unified Hugging Face checkpoint format.Deprecated
examples/vlm_evalas it depends on the deprecated TRT-LLM’s TRT backend.
New Features
high_precision_dtypedefault to fp16 in ONNX quantization, i.e. quantized output model weights are now FP16 by default.Upgrade TensorRT-LLM dependency to 1.1.0rc2.
Support Phi-4-multimodal and Qwen2.5-VL quantized HF checkpoint export in
examples/vlm_ptq.Support storing and restoring Minitron pruning activations and scores for re-pruning without running the forward loop again.
Add Minitron pruning example for Megatron-LM framework. See
examples/megatron-lmfor more details.
0.35 (2025-09-04)
Deprecations
Deprecate
torch<2.6support.Deprecate NeMo 1.0 model support.
Bug Fixes
Fix attention head ranking logic for pruning Megatron Core GPT models.
New Features
ModelOpt now supports PTQ and QAT for GPT-OSS models. See
examples/gpt_ossfor end-to-end PTQ/QAT example.Add support for QAT with HuggingFace + DeepSpeed. See
examples/gpt_ossfor an example.Add support for QAT with LoRA. The LoRA adapters can be folded into the base model after QAT and deployed just like a regular PTQ model. See
examples/gpt_ossfor an example.ModelOpt provides convenient trainers such as
QATTrainer,QADTrainer,KDTrainer,QATSFTTrainerwhich inherits from Huggingface trainers. ModelOpt trainers can be used as drop in replacement of the corresponding Huggingface trainer. See usage examples inexamples/gpt_oss,examples/llm_qatorexamples/llm_distill.(Experimental) Add quantization support for custom TensorRT op in ONNX models.
Add support for Minifinetuning (MFT; https://arxiv.org/abs/2506.15702) self-corrective distillation, which enables training on small datasets with severely mitigated catastrophic forgetting.
Add tree decoding support for Megatron Eagle models.
For most VLMs, we now explicitly disable quant on the vision part so we add them to the excluded_modules during HF export.
Add support for
mamba_num_heads,mamba_head_dim,hidden_sizeandnum_layerspruning for Megatron Core Mamba or Hybrid Transformer Mamba models inmcore_minitron(previouslymcore_gpt_minitron) mode.Add example for QAT/QAD training with LLaMA Factory. See
examples/llm_qat/llama_factoryfor more details.Upgrade TensorRT-LLM dependency to 1.0.0rc6.
Add unified HuggingFace model export support for quantized NVFP4 GPT-OSS models.
0.33 (2025-07-14)
Backward Breaking Changes
PyTorch dependencies for
modelopt.torchfeatures are no longer optional andpip install nvidia-modeloptis now same aspip install nvidia-modelopt[torch].
New Features
Upgrade TensorRT-LLM dependency to 0.20.
Add new CNN QAT example to demonstrate how to use ModelOpt for QAT.
Add support for ONNX models with custom TensorRT ops in Autocast.
Add quantization aware distillation (QAD) support in
llm_qatexample.Add support for BF16 in ONNX quantization.
Add per node calibration support in ONNX quantization.
ModelOpt now supports quantization of tensor-parallel sharded Huggingface transformer models. This requires
transformers>=4.52.0.Support quantization of FSDP2 wrapped models and add FSDP2 support in the
llm_qatexample.Add NeMo 2 Simplified Flow examples for quantization aware training/distillation (QAT/QAD), speculative decoding, pruning & distillation.
Fix a Qwen3 MOE model export issue.
0.31 (2025-06-04)
Backward Breaking Changes
- NeMo and Megatron-LM distributed checkpoint (
torch-dist) stored with legacy version can no longer be loaded. The remedy is to load the legacy distributed checkpoint with 0.29 and store atorchcheckpoint and resume with 0.31 to convert to a new format. The following changes only apply to storing and resuming distributed checkpoint. quantizer_stateofTensorQuantizeris now stored inextra_stateofQuantModulewhere it used to be stored in the shardedmodelopt_state.The dtype and shape of
amaxandpre_quant_scalestored in the distributed checkpoint are now restored. Some dtype and shape are previously changed to make all decoder layers to have homogeneous structure in the checkpoint.Together with megatron.core-0.13, quantized model will store and resume distributed checkpoint in a heterogenous format.
- NeMo and Megatron-LM distributed checkpoint (
- auto_quantize API now accepts a list of quantization config dicts as the list of quantization choices.
This API previously accepts a list of strings of quantization format names. It was therefore limited to only pre-defined quantization formats unless through some hacks.
With this change, now user can easily use their own custom quantization formats for auto_quantize.
In addition, the
quantization_formatsnow excludeNone(indicating “do not quantize”) as a valid format because the auto_quantize internally always add “do not quantize” as an option anyway.
Model export config is refactored. The quant config in
hf_quant_config.jsonis converted and saved toconfig.json.hf_quant_config.jsonwill be deprecated soon.
Deprecations
Deprecate
Python 3.9support.
New Features
Upgrade LLM examples to use TensorRT-LLM 0.19.
Add new model support in the
llm_ptqexample: Qwen3 MoE.ModelOpt now supports advanced quantization algorithms such as AWQ, SVDQuant and SmoothQuant for cpu-offloaded Huggingface models.
Add AutoCast tool to convert ONNX models to FP16 or BF16.
Add
--low_memory_modeflag in the llm_ptq example support to initialize HF models with compressed weights and reduce peak memory of PTQ and quantized checkpoint export.Support
NemotronHForCausalLM,Qwen3ForCausalLM,Qwen3MoeForCausalLMMegatron Core model import/export (from/to HuggingFace).
0.29 (2025-05-08)
Backward Breaking Changes
Refactor
SequentialQuantizerto improve its implementation and maintainability while preserving its functionality.
Deprecations
Deprecate
torch<2.4support.
New Features
Upgrade LLM examples to use TensorRT-LLM 0.18.
Add new model support in the
llm_ptqexample: Gemma-3, Llama-Nemotron.Add INT8 real quantization support.
Add an FP8 GEMM per-tensor quantization kernel for real quantization. After PTQ, you can leverage the
mtq.compressAPI to accelerate evaluation of quantized models.Use the shape of Pytorch parameters and buffers of
TensorQuantizerto initialize them during restore. This makes quantized model restoring more robust.Support adding new custom quantization calibration algorithms. Please refer to
mtq.calibrateor custom calibration algorithm for more details.Add EAGLE3 (
LlamaForCausalLMEagle3) training and unified ModelOpt checkpoint export support for Megatron-LM.- Add support for
--override_shapesflag to ONNX quantization. --calibration_shapesis reserved for the input shapes used for calibration process.--override_shapesis used to override the input shapes of the model with static shapes.
- Add support for
Add support for UNet ONNX quantization.
Enable
concat_eliminationpass by default to improve the performance of quantized ONNX models.Enable Redundant Cast elimination pass by default in
moq.quantize.Add new attribute
parallel_statetoDynamicModuleto support distributed parallelism such as data parallel and tensor parallel.Add MXFP8, NVFP4 quantized ONNX export support.
Add new example for torch quantization to ONNX for MXFP8, NVFP4 precision.
0.27 (2025-04-03)
Deprecations
Deprecate real quantization configs, please use
mtq.compressAPI for model compression after quantization.
New Features
Add new model support in the
llm_ptqexample: OpenAI Whisper. Experimental support: Llama4, QwQ, Qwen MOE.Add blockwise FP8 quantization support in unified model export.
Add quantization support to the Transformer Engine Linear module.
Add support for SVDQuant. Currently, only simulation is available; real deployment (for example, TensorRT deployment) support is coming soon.
Store
modelopt_statein Megatron Core distributed checkpoint (used in NeMo and Megatron-LM) differently to support distributed checkpoint resume expert-parallel (EP). The legacymodelopt_statein the distributed checkpoint generated by previous modelopt version can still be loaded in 0.27 and 0.29 but will need to be stored in the new format.Add triton-based NVFP4 quantization kernel that delivers approximately 40% performance improvement over the previous implementation.
Add a new API
mtq.compressfor model compression for weights after quantization.Add option to simplify ONNX model before quantization is performed.
Add FP4 KV cache support for unified HF and TensorRT-LLM export.
Add speculative decoding support to Multi-Token Prediction (MTP) in Megatron Core models.
- (Experimental) Improve support for ONNX models with custom TensorRT op:
Add support for
--calibration_shapesflag.Add automatic type and shape tensor propagation for full ORT support with TensorRT EP.
Known Issues
Quantization of T5 models is broken. Please use
nvidia-modelopt==0.25.0withtransformers<4.50meanwhile.
0.25 (2025-03-03)
Deprecations
Deprecate Torch 2.1 support.
Deprecate
humanevalbenchmark inllm_evalexamples. Please use the newly addedsimple_evalinstead.Deprecate
fp8_naivequantization format inllm_ptqexamples. Please usefp8instead.
New Features
Support fast hadamard transform in
TensorQuantizer. It can be used for rotation based quantization methods, e.g. QuaRot. Users need to install the package fast_hadamard_transform to use this feature.Add affine quantization support for the KV cache, resolving the low accuracy issue in models such as Qwen2.5 and Phi-3/3.5.
Add FSDP2 support. FSDP2 can now be used for QAT.
Add LiveCodeBench and Simple Evals to the
llm_evalexamples.Disabled saving modelopt state in unified hf export APIs by default, i.e., added
save_modelopt_stateflag inexport_hf_checkpointAPI and by default set to False.Add FP8 and NVFP4 real quantization support with LLM QLoRA example.
The
modelopt.deploy.llm.LLMnow support use thetensorrt_llm._torch.LLMbackend for the quantized HuggingFace checkpoints.Add end-to-end AutoDeploy example for AutoQuant LLM models.
0.23 (2025-01-29)
Backward Breaking Changes
Support TensorRT-LLM to 0.17. Examples (e.g. benchmark task in llm_ptq) may not be fully compatible with TensorRT-LLM 0.15.
Nvidia TensorRT Model Optimizer has changed its LICENSE from NVIDIA Proprietary (library wheel) and MIT (examples) to Apache 2.0 in this first full OSS release.
Deprecate Python 3.8, Torch 2.0, and Cuda 11.x support.
ONNX Runtime dependency upgraded to 1.20 which no longer supports Python 3.9.
In the Huggingface examples, the
trust_remote_codeis by default set to false and require users to explicitly turning it on with--trust_remote_codeflag.
New Features
Added OCP Microscaling Formats (MX) for fake quantization support, including FP8 (E5M2, E4M3), FP6 (E3M2, E2M3), FP4, INT8.
Added NVFP4 quantization support for NVIDIA Blackwell GPUs along with updated examples.
Allows export lm_head quantized TensorRT-LLM checkpoint. Quantize lm_head could benefit smaller sized models at a potential cost of additional accuracy loss.
TensorRT-LLM now supports Moe FP8 and w4a8_awq inference on SM89 (Ada) GPUs.
New models support in the
llm_ptqexample: Llama 3.3, Phi 4.Added Minitron pruning support for NeMo 2.0 GPT models.
Exclude modules in TensorRT-LLM export configs are now wildcards
The unified llama3.1 FP8 huggingface checkpoints can be deployed on SGLang.
0.21 (2024-12-03)
Backward Breaking Changes
Support TensorRT-LLM to 0.15. Examples (e.g. benchmark task in llm_ptq) may not be fully compatible with TensorRT-LLM 0.14.
Remove the deprecated arg
export_npzfrom themt.export.export_tensorrt_llm_checkpointAPIDeprecate
mt.export.export_to_vllmAPI formt.export.export_hf_checkpointRename decoder type
gptnexttogptinllm_ptqto align with TensorRT-LLM model definition.
New Features
Added new tutorial notebooks for Minitron pruning and distillation in NVIDIA NeMo framework.
New models support in the
llm_ptqexample: Minitron, Phi3.5 MOE.New models support in the
vlm_ptqexample: Llama3.2(Mllama)mt.export.export_tensorrt_llm_checkpointandmt.export.export_hf_checkpointno longer requires thedtypearg.Added an example to deploy and run quantized fp8 llama3.1 8B instruct model from HuggingFace modelopt model hub on both TensorRT and vLLM.
Bug Fixes
Improve Minitron pruning quality by avoiding possible bf16 overflow in importance calculation and minor change in
hidden_sizeimportance ranking.
Misc
Added deprecation warnings for Python 3.8, torch 2.0, and CUDA 11.x. Support will be dropped in the next release.
0.19 (2024-10-23)
Backward Breaking Changes
Deprecated the summarize task in the
llm_ptqexample.Deprecated the
typeflag in the huggingface_example.shDeprecated Python plugin support in ONNX.
Support TensorRT-LLM 0.13. Examples not compatible with TensorRT-LLM 0.12.
mtq.auto_quantizeAPI has been updated. The API now acceptsforward_stepandforward_backward_stepas arguments instead ofloss_funcandcollect_func. Please see the API documentation for more details.
New Features
ModelOpt is compatible for SBSA aarch64 (e.g. GH200) now! Except ONNX PTQ with plugins is not supported.
Add
effective_bitsas a constraint formtq.auto_qauntize.lm_evaluation_harnessis fully integrated to modelopt backed by TensorRT-LLM.lm_evaluation_harnessbenchmarks are now available in the examples for LLM accuracy evaluation.A new
--perfflag is introduced in themodelopt_to_tensorrt_llm.pyexample to build engines with max perf.Users can choose the execution provider to run the calibration in ONNX quantization.
Added automatic detection of custom ops in ONNX models using TensorRT plugins. This requires the
tensorrtpython package to be installed.Replaced
jaxwithcupyfor faster INT4 ONNX quantization.mtq.auto_quantizenow supports search based automatic quantization for NeMo & MCore models (in addition to HuggingFace models).Add
num_layersandhidden_sizepruning support for NeMo / Megatron-core models.
0.17 (2024-09-11)
Backward Breaking Changes
Deprecated
torch<2.0support.modelopt.torch.utils.dataset_utils.get_dataset_dataloader()now returns a key value pair instead of the tensor.
New Features
New APIs and examples:
modelopt.torch.prunefor pruning Conv, Linear, and Attention heads for NVIDIA Megatron-core GPT-style models (e.g. Llama 3), PyTorch Computer Vision models, and HuggingFace Bert/GPT-J models.New API:
modelopt.torch.distillfor knowledge distillation, along with guides and example.New Example: HF BERT Prune, Distill & Quantize showcasing how to chain pruning, distillation, and quantization to achieve the best performance on a given model.
Added INT8/FP8 DQ-only support for ONNX model.
New API:
modelopt.torch.speculativefor end-to-end support of Medusa models.Added Medusa QAT and End-to-end examples.
Modelopt now supports automatic save/restore of
modelopt_statewith the.save_pretrainedand.from_pretrainedAPIs from Huggingface libraries, such astransformersanddiffusers. This feature can be enabled by callingmto.enable_huggingface_checkpointing().ONNX FP8 quantization support with amax calibration.
TensorRT-LLM dependency upgraded to 0.12.0. Huggingface tokenizer files are now also stored in the engine dir.
The unified model export API
modelopt.torch.export.export_hf_checkpointsupports exportingfp8andint4_awqquantized checkpoints with packed weights for Hugging Face models with namings aligned with its original checkpoints. The exportedfp8checkpoints can be deployed with both TensorRT-LLM and VLLM.Add int8 and fp8 quantization support for the FLUX.1-dev model.
Add a Python-friendly TensorRT inference pipeline for diffusion models.
Misc
Added deprecation warning for
set_data_parallel_groupandset_tensor_parallel_group. These APIs are no longer needed for supporting distributed data and tensor parallelism in quantization. They will be removed in a future release.
0.15 (2024-07-25)
Backward Breaking Changes
Deprecated
QuantDescriptor. UseQuantizerAttributeConfigto configureTensorQuantizer.set_from_attribute_configcan be used to set the quantizer attributes from the config class or attribute dictionary. This change applies only to backend APIs. The change is backward compatible if you are using only themtq.quantizeAPI.
New Features
Added quantization support for torch
RNN, LSTM, GRUmodules. Only available fortorch>=2.0.modelopt.torch.quantizationnow supports module class based quantizer attribute setting formtq.quantizeAPI.Added new LLM PTQ example for DBRX model.
Added new LLM (Gemma 2) PTQ and TensorRT-LLM checkpoint export support.
Added new LLM QAT example for NVIDIA NeMo framework.
TensorRT-LLM dependency upgraded to 0.11.0.
(Experimental):
mtq.auto_quantizeAPI which quantizes a model by searching for the best per-layer quantization formats.(Experimental): Added new LLM QLoRA example with NF4 and INT4_AWQ quantization.
(Experimental):
modelopt.torch.exportnow supports exporting quantized checkpoints with packed weights for Hugging Face models with namings aligned with its original checkpoints.(Experimental) Added support for quantization of ONNX models with TensorRT plugin.
Misc
Added deprecation warning for
torch<2.0. Support will be dropped in next release.
0.13 (2024-06-14)
Backward Breaking Changes
PTQ examples have been upgraded to use TensorRT-LLM 0.10.
New Features
Adding TensorRT-LLM checkpoint export support for Medusa decoding (official
MedusaModeland Megatron CoreGPTModel).Enable support for mixtral, recurrentgemma, starcoder, qwen in PTQ examples.
Adding TensorRT-LLM checkpoint export and engine building support for sparse models.
Import scales from TensorRT calibration cache and use them for quantization.
(Experimental) Enable low GPU memory FP8 calibration for the Hugging Face models when the original model size does not fit into the GPU memory.
(Experimental) Support exporting FP8 calibrated model to VLLM deployment.
(Experimental) Python 3.12 support added.
0.11 (2024-05-07)
Backward Breaking Changes
[!!!] The package was renamed from
ammotomodelopt. The new full product name is Nvidia TensorRT Model Optimizer. PLEASE CHANGE ALL YOUR REFERENCES FROMammotomodeloptincluding any paths and links!Default installation
pip install nvidia-modeloptwill now only install minimal core dependencies. Following optional dependencies are available depending on the features that are being used:[deploy], [onnx], [torch], [hf]. To install all dependencies, usepip install "nvidia-modelopt[all]".Deprecated
inference_gpusarg inmodelopt.torch.export.model_config_export.torch_to_tensorrt_llm_checkpoint. User should useinference_tensor_parallelinstead.Experimental
modelopt.torch.deploymodule is now available asmodelopt.torch._deploy.
New Features
modelopt.torch.sparsitynow supports sparsity-aware training (SAT). Both SAT and post-training sparsification supports chaining with other modes, e.g. SAT + QAT.modelopt.torch.quantizationnatively support distributed data and tensor parallelism while estimating quantization parameters. The data and tensor parallel groups needs to be registered withmodelopt.torch.utils.distributed.set_data_parallel_groupandmodelopt.torch.utils.distributed.set_tensor_parallel_groupAPIs. By default, the data parallel group is set as the default distributed group and the tensor parallel group is disabled.modelopt.torch.optnow supports chaining multiple optimization techniques that each require modifications to the same model, e.g., you can now sparsify and quantize a model at the same time.modelopt.onnx.quantizationsupports FLOAT8 quantization format with Distribution calibration algorithm.Native support of
modelopt.torch.optwith FSDP (Fully Sharded Data Parallel) fortorch>=2.1. This includes sparsity, quantization, and any other model modification & optimization.Added FP8 ONNX quantization support in
modelopt.onnx.quantization.Added Windows (
win_amd64) support for ModelOpt released wheels. Currently supported formodelopt.onnxsubmodule only.
Bug Fixes
Fixed the compatibility issue of
modelopt.torch.sparsitywith FSDP.Fixed an issue in dynamic dim handling in
modelopt.onnx.quantizationwith random calibration data.Fixed graph node naming issue after opset conversion operation.
Fixed an issue in negative dim handling like dynamic dim in
modelopt.onnx.quantizationwith random calibration data.Fixed allowing to accept
.pbfile for input file.Fixed copy extra data to tmp folder issue for ONNX PTQ.