.. _Support_Matrix: ============== Support Matrix ============== Feature Support Matrix ====================== .. tab:: Linux .. list-table:: :widths: 20 40 20 20 :header-rows: 1 :stub-columns: 1 * - Quantization Format - Details - Supported Model Formats - Deployment * - FP4 - * Per-Block FP4 Weight & Activations * GPUs: Blackwell and Later - PyTorch - TensorRT, TensorRT-LLM * - FP8 - * Per-Tensor FP8 Weight & Activations * GPUs: Ada and Later - PyTorch, ONNX* - TensorRT*, TensorRT-LLM * - INT8 - * Per-channel INT8 Weights, Per-Tensor INT8 Activations * Uses Smooth Quant Algorithm * GPUs: Ampere and Later - PyTorch, ONNX* - TensorRT*, TensorRT-LLM * - W4A16 (INT4 Weights Only) - * Block-wise INT4 Weights, F16 Activations * Uses AWQ Algorithm * GPUs: Ampere and Later - PyTorch, ONNX - TensorRT, TensorRT-LLM * - W4A8 (INT4 Weights, FP8 Activations) - * Block-wise INT4 Weights, Per-Tensor FP8 Activations * Uses AWQ Algorithm * GPUs: Ada and Later - PyTorch*, ONNX* - TensorRT-LLM .. tab:: Windows .. list-table:: :widths: 20 40 20 20 :header-rows: 1 :stub-columns: 1 * - Quantization Format - Details - Supported Model Formats - Deployment * - W4A16 (INT4 Weights Only) - * Block-wise INT4 Weights, F16 Activations * Uses AWQ Algorithm * GPUs: Ampere and Later - PyTorch*, ONNX - ORT-DirectML, TensorRT*, TensorRT-LLM* * - W4A8 (INT4 Weights, FP8 Activations) - * Block-wise INT4 Weights, Per-Tensor FP8 Activations * Uses AWQ Algorithm * GPUs: Ada and Later - PyTorch* - TensorRT-LLM* * - FP8 - * Per-Tensor FP8 Weight & Activations (PyTorch) * Per-Tensor Activation and Per-Channel Weights quantization (ONNX) * Uses Max calibration * GPUs: Ada and Later - PyTorch*, ONNX - TensorRT*, TensorRT-LLM*, ORT-CUDA * - INT8 - * Per-Channel INT8 Weights, Per-Tensor INT8 Activations * Uses Smooth Quant (PyTorch)*, Max calibration (ONNX) * GPUs: Ada and Later - PyTorch*, ONNX - TensorRT*, TensorRT-LLM*, ORT-CUDA .. note:: Features marked with an asterisk (*) are considered experimental. Model Support Matrix ==================== .. tab:: Linux Please checkout the model support matrix `here `_. .. tab:: Windows .. list-table:: :header-rows: 1 * - Model - ONNX INT4 AWQ (W4A16) - ONNX INT8 Max (W8A8) - ONNX FP8 Max (W8A8) * - Llama3.1-8B-Instruct - Yes - No - No * - Phi3.5-mini-Instruct - Yes - No - No * - Mistral-7B-Instruct-v0.3 - Yes - No - No * - Llama3.2-3B-Instruct - Yes - No - No * - Gemma-2b-it - Yes - No - No * - Gemma-2-2b - Yes - No - No * - Gemma-2-9b - Yes - No - No * - Nemotron Mini 4B Instruct - Yes - No - No * - Qwen2.5-7B-Instruct - Yes - No - No * - DeepSeek-R1-Distill-Llama-8B - Yes - No - No * - DeepSeek-R1-Distil-Qwen-1.5B - Yes - No - No * - DeepSeek-R1-Distil-Qwen-7B - Yes - No - No * - DeepSeek-R1-Distill-Qwen-14B - Yes - No - No * - Mistral-NeMo-Minitron-2B-128k-Instruct - Yes - No - No * - Mistral-NeMo-Minitron-4B-128k-Instruct - Yes - No - No * - Mistral-NeMo-Minitron-8B-128k-Instruct - Yes - No - No * - whisper-large - No - Yes - Yes * - sam2-hiera-large - No - Yes - Yes .. note:: - ``ONNX INT8 Max`` means INT8 (W8A8) quantization of ONNX model using Max calibration. Similar holds true for the term ``ONNX FP8 Max``. - The LLMs in above table are `GenAI `_ built LLMs unless specified otherwise. - Check `examples `_ for specific instructions and scripts.