.. _Support_Matrix: ============== Support Matrix ============== Feature Support Matrix ====================== .. tab:: Linux .. list-table:: :widths: 20 40 20 20 :header-rows: 1 :stub-columns: 1 * - Quantization Format - Details - Supported Model Formats - Deployment * - FP4 - * Per-Block FP4 Weight & Activations * GPUs: Blackwell and Later - PyTorch - TensorRT, TensorRT-LLM * - FP8 - * Per-Tensor FP8 Weight & Activations * GPUs: Ada and Later - PyTorch, ONNX* - TensorRT*, TensorRT-LLM * - INT8 - * Per-channel INT8 Weights, Per-Tensor INT8 Activations * Uses Smooth Quant Algorithm * GPUs: Ampere and Later - PyTorch, ONNX* - TensorRT*, TensorRT-LLM * - W4A16 (INT4 Weights Only) - * Block-wise INT4 Weights, F16 Activations * Uses AWQ Algorithm * GPUs: Ampere and Later - PyTorch, ONNX - TensorRT, TensorRT-LLM * - W4A8 (INT4 Weights, FP8 Activations) - * Block-wise INT4 Weights, Per-Tensor FP8 Activations * Uses AWQ Algorithm * GPUs: Ada and Later - PyTorch*, ONNX* - TensorRT-LLM .. tab:: Windows .. list-table:: :widths: 20 40 20 20 :header-rows: 1 :stub-columns: 1 * - Quantization Format - Details - Supported Model Formats - Deployment * - W4A16 (INT4 Weights Only) - * Block-wise INT4 Weights, F16 Activations * Uses AWQ Algorithm * GPUs: Ampere and Later - PyTorch*, ONNX - ORT-DML, ORT-CUDA, ORT-TRT-RTX, TensorRT*, TensorRT-LLM* * - W4A8 (INT4 Weights, FP8 Activations) - * Block-wise INT4 Weights, Per-Tensor FP8 Activations * Uses AWQ Algorithm * GPUs: Ada and Later - PyTorch* - TensorRT-LLM* * - FP8 - * Per-Tensor FP8 Weight & Activations (PyTorch) * Per-Tensor Activation and Per-Channel Weights quantization (ONNX) * Uses Max calibration * GPUs: Ada and Later - PyTorch*, ONNX - TensorRT*, TensorRT-LLM*, ORT-CUDA * - INT8 - * Per-Channel INT8 Weights, Per-Tensor INT8 Activations * Uses Smooth Quant (PyTorch)*, Max calibration (ONNX) * GPUs: Ada and Later - PyTorch*, ONNX - TensorRT*, TensorRT-LLM*, ORT-CUDA .. note:: - Features marked with an asterisk (*) are considered experimental. - ``ORT-CUDA``, ``ORT-DML``, and ``ORT-TRT-RTX`` are ONNX Runtime Execution Providers (EPs) for CUDA, DirectML, and TensorRT-RTX respectively. Support for different deployment backends can vary across models. Model Support Matrix ==================== .. tab:: Linux Please checkout the model support matrix `here `_. .. tab:: Windows Please checkout the model support matrix `details `_.