Feature Support Matrix

Quantization Techniques - Windows

Quantization Format

Details

Supported Model Formats

Deployment

W4A16 (INT4 Weights Only)

  • Block-wise INT4 Weights, F16 Activations

  • Uses AWQ Algorithm

  • GPUs: Ampere and Later

PyTorch*, ONNX

  • ORT-DirectML, TensorRT*, TensorRT-LLM*

W4A8 (INT4 Weights, FP8 Activations)

  • Block-wise INT8 Weights, Per-Tensor FP8 Activations

  • Uses AWQ Algorithm

  • GPUs: Ada and Later

PyTorch*

  • TensorRT-LLM*

FP8

  • Per-Tensor FP8 Weight & Activations

  • GPUs: Ada and Later

PyTorch*, ONNX*

  • TensorRT*, TensorRT-LLM*

INT8

  • Per-channel INT8 Weights, Per-Tensor FP8 Activations

  • Uses Smooth Quant Algorithm

  • GPUs: Ada and Later

PyTorch*, ONNX*

  • TensorRT*, TensorRT-LLM*

Note

Features marked with an asterisk (*) are considered experimental.

Supported Models - Windows

Model

ONNX INT4 AWQ

Llama3.1-8B-Instruct

Yes

Phi3.5-mini-Instruct

Yes

Mistral-7B-Instruct-v0.3

Yes

Llama3.2-3B-Instruct

Yes

Gemma-2b-it

Yes

Nemotron Mini 4B Instruct

Yes