Experimental Quantization Package Design#

experimental.quantization is a standalone checkpoint quantization package. It converts FP16 HuggingFace checkpoints into quantized HuggingFace-style checkpoints that the checkpoint-based loader can export directly.

For command-line usage, see Quantization.

Design Goals#

  • Keep quantization separate from ONNX export.

  • Preserve a HuggingFace-compatible checkpoint layout.

  • Store quantization metadata in checkpoint config files so llm_loader can infer export behavior from the checkpoint.

  • Share the same output contract for locally quantized and downloaded pre-quantized checkpoints.

Package Layout#

Path

Role

experimental/quantization/cli.py

User-facing llm and draft commands

experimental/quantization/quantize.py

Model loading, ModelOpt configuration, calibration, and checkpoint writing

experimental/quantization/utils.py

Shared checkpoint and config helpers

experimental/llm_loader/config.py

Reads quantization metadata during ONNX export

experimental/llm_loader/checkpoint/loader.py

Loads and repacks quantized checkpoint tensors

Flow#

FP16 checkpoint
  -> experimental.quantization
  -> quantized safetensors + quantization metadata
  -> llm_loader
  -> ONNX + runtime sidecars

The quantization package stops after writing the checkpoint. ONNX export, TensorRT engine build, and runtime execution remain owned by llm_loader, llm_build, and llm_inference.

Artifact Contract#

The output directory must contain:

  • config.json with model architecture fields.

  • Quantization metadata such as quantization_config, hf_quant_config.json, or equivalent fields understood by llm_loader.

  • One or more .safetensors checkpoint shards.

  • Tokenizer and processor files needed by the runtime.

llm_loader uses this metadata to select quantized linear layers, repack checkpoint tensors, and enable FP8 KV cache when the checkpoint marks KV cache quantization as fp8. No separate FP8 KV export flag is required in the llm_loader path.

Supported Methods#

Component

Methods

Backbone

fp8, int4_awq, nvfp4, mxfp8, int8_sq

LM head

fp8, int4_awq, nvfp4, mxfp8

KV cache

fp8

Backbone and LM-head methods can be combined for mixed-precision checkpoints.

Limitations#

  • Audio and visual calibration are not implemented.

  • Some model-specific legacy export workarounds are intentionally not included.

  • This package produces checkpoints only. It does not build ONNX or TensorRT engines.