Overview

NVIDIA TensorRT Model Optimizer

Minimizing inference costs presents a significant challenge as generative AI models continue to grow in complexity and size. The NVIDIA TensorRT Model Optimizer (referred to as Model Optimizer, or ModelOpt) is a library comprising state-of-the-art model optimization techniques including quantization and sparsity to compress model. It accepts a torch or ONNX model as inputs and provides Python APIs for users to easily stack different model optimization techniques to produce quantized checkpoint. Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks like TensorRT-LLM or TensorRT. Further integrations are planned for NVIDIA NeMo and Megatron-LM for training-in-the-loop optimization techniques. For enterprise users, the 8-bit quantization with Stable Diffusion is also available on NVIDIA NIM.

Model Optimizer is available for free for all developers on NVIDIA PyPI. Visit /NVIDIA/TensorRT-Model-Optimizer repository for end-to-end example scripts and recipes optimized for NVIDIA GPUs.

Techniques

Quantization

Quantization is an effective model optimization technique for large models. Quantization with Model Optimizer can compress model size by 2x-4x, speeding up inference while preserving model quality. Model Optimizer enables highly performant quantization formats including FP8, INT8, INT4, etc and supports advanced algorithms such as SmoothQuant, AWQ, and Double Quantization with easy-to-use Python APIs. Both Post-training quantization (PTQ) and Quantization-aware training (QAT) are supported. Visit Quantization Format page for list of formats supported.

Sparsity

Sparsity is a technique to further reduce the memory footprint of deep learning models and accelerate the inference. Model Optimizer provides Python API mts.sparsify() to apply weight sparsity to a given model. The mts.sparsify() API supports NVIDIA 2:4 sparsity pattern and various sparsification methods, such as NVIDIA ASP and SparseGPT. It supports both post-training sparsity and sparsity with fine-tuning. The latter workflow is recommended to minimize accuracy degradation.