Overview
NVIDIA TensorRT Model Optimizer
Minimizing inference costs presents a significant challenge as generative AI models continue to grow in complexity and size. The NVIDIA TensorRT Model Optimizer (referred to as Model Optimizer, or ModelOpt) is a library comprising state-of-the-art model optimization techniques including quantization and sparsity to compress model. It accepts a torch or ONNX model as input and provides Python APIs for users to easily stack different model optimization techniques to produce optimized & quantized checkpoints. Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks like TensorRT-LLM or TensorRT. Further integrations are planned for NVIDIA NeMo and Megatron-LM for training-in-the-loop optimization techniques. For enterprise users, the 8-bit quantization with Stable Diffusion is also available on NVIDIA NIM.
Model Optimizer is available for free for all developers on NVIDIA PyPI. Visit the TensorRT Model Optimizer GitHub repository for end-to-end example scripts and recipes optimized for NVIDIA GPUs.
Techniques
Quantization
Quantization is an effective model optimization technique for large models. Quantization with Model Optimizer can compress
model size by 2x-4x, speeding up inference while preserving model quality. Model Optimizer enables highly performant
quantization formats including FP8, INT8, INT4, etc and supports advanced algorithms such as SmoothQuant, AWQ, and
Double Quantization with easy-to-use Python APIs. Both Post-training quantization (PTQ) and Quantization-aware training (QAT)
are supported. Visit Quantization Format page
for list of formats supported.
Sparsity
Sparsity is a technique to further reduce the memory footprint of deep learning models and accelerate the inference.
Model Optimizer provides the Python API mts.sparsify()
to
automatically apply weight sparsity to a given model. The
mts.sparsify()
API supports
NVIDIA 2:4 sparsity pattern and various sparsification methods,
such as NVIDIA ASP and
SparseGPT. It supports both post-training sparsity (PTS) and
sparsity-aware training (SAT). The latter workflow is recommended to minimize accuracy
degradation.
Distillation
Knowledge Distillation is the use of an existing pretrained “teacher” model to train a smaller, more efficient “student” model.
It allows for increasing the accuracy and/or convergence speed over traditional training.
The feature maps and logits of the teacher and student become the targets and predictions for the (user-specified) loss, respectively.
Model Optimizer allows for minimally-invasive integration of teacher-student Knowledge Distillation into an existing training pipeline
using the mtd.convert()
API.
Pruning
Pruning is a technique to reduce the model size and accelerate the inference by removing unnecessary weights.
Model Optimizer provides the Python API mtp.prune()
to prune Linear and
Conv layers, and Transformer attention heads, MLP, and depth through various different state of the art algorithms.