Experimental Quantization Package Design#
experimental.quantization is a standalone checkpoint quantization package. It converts FP16 HuggingFace checkpoints into quantized HuggingFace-style checkpoints that the checkpoint-based loader can export directly.
For command-line usage, see Quantization.
Design Goals#
Keep quantization separate from ONNX export.
Preserve a HuggingFace-compatible checkpoint layout.
Store quantization metadata in checkpoint config files so
llm_loadercan infer export behavior from the checkpoint.Share the same output contract for locally quantized and downloaded pre-quantized checkpoints.
Package Layout#
Path |
Role |
|---|---|
|
User-facing |
|
Model loading, ModelOpt configuration, calibration, and checkpoint writing |
|
Shared checkpoint and config helpers |
|
Reads quantization metadata during ONNX export |
|
Loads and repacks quantized checkpoint tensors |
Flow#
FP16 checkpoint
-> experimental.quantization
-> quantized safetensors + quantization metadata
-> llm_loader
-> ONNX + runtime sidecars
The quantization package stops after writing the checkpoint. ONNX export, TensorRT engine build, and runtime execution remain owned by llm_loader, llm_build, and llm_inference.
Artifact Contract#
The output directory must contain:
config.jsonwith model architecture fields.Quantization metadata such as
quantization_config,hf_quant_config.json, or equivalent fields understood byllm_loader.One or more
.safetensorscheckpoint shards.Tokenizer and processor files needed by the runtime.
llm_loader uses this metadata to select quantized linear layers, repack checkpoint tensors, and enable FP8 KV cache when the checkpoint marks KV cache quantization as fp8. No separate FP8 KV export flag is required in the llm_loader path.
Supported Methods#
Component |
Methods |
|---|---|
Backbone |
|
LM head |
|
KV cache |
|
Backbone and LM-head methods can be combined for mixed-precision checkpoints.
Limitations#
Audio and visual calibration are not implemented.
Some model-specific legacy export workarounds are intentionally not included.
This package produces checkpoints only. It does not build ONNX or TensorRT engines.