qdq_utils
Various utils to support inserting Q/DQ nodes.
Functions
Transpose quantized weights and scales in-place for column-major storage. |
|
Casts the initializer to the given dtype. |
|
Get the names of all quantized tensors from an ONNX model. |
|
Get the appropriate tensor dtype based on precision info and zero point presence. |
|
Check if the onnx graph already has QDQ nodes. |
|
Insert new initializers and DQ nodes into graph. |
|
Insert new mul nodes into graph. |
|
Insert scales and QDQ nodes into graph. |
|
Add a single Transpose node after each DequantizeLinear for column-major weights. |
|
Create a GraphSurgeon scale tensor from the given numpy array. |
|
Create a GraphSurgeon Dequantize node. |
|
Create a GraphSurgeon variable representing the output of a quantize node. |
|
Create a GraphSurgeon Dequantize node. |
|
Create a GraphSurgeon variable representing the output of a quantize node. |
|
Create a GraphSurgeon Quantize node. |
|
Create a GraphSurgeon variable representing the output of a quantize node. |
|
Create a GraphSurgeon tensor from a quantized weight tensor. |
|
Create a GraphSurgeon scale tensor from the given numpy array. |
|
Create a GraphSurgeon zero-point tensor of all zeroes with the given shape. |
|
Convert FP32/FP16 weights of the given ONNX model to INT8/FP8 weights. |
|
Remove Q nodes from the inputs of a quantized ONNX model. |
|
Remove DQ nodes from the input and Q from the output of quantized custom ops for TensorRT compatibility. |
|
Replace scale values in the graph with values from calibration cache. |
|
Replace zero scale values with smallest nonzero fp16 value in the ONNX model. |
|
Get the attributes for per-channel nodes. |
|
Globally set node names to TRT custom names. |
|
Validate the shape of the scale tensor for per-channel nodes. |
- apply_column_major_transformation(gemm_weights_quantized, scales)
Transpose quantized weights and scales in-place for column-major storage.
Note: After calling this function and inserting DQ nodes with axis=1, you should call insert_transpose_nodes_for_column_major() on the graph.
- Parameters:
gemm_weights_quantized (dict) – Dictionary mapping weight names to quantized weight arrays
scales (dict) – Dictionary mapping weight names to scale arrays
- Return type:
None
- cast_initializer_to_dtype(node, dtype, initializer_map)
Casts the initializer to the given dtype.
- Parameters:
node (NodeProto)
dtype (str)
initializer_map (dict[str, TensorProto])
- get_quantized_tensors(onnx_model)
Get the names of all quantized tensors from an ONNX model.
This function identifies all DequantizeLinear nodes in the ONNX model and extracts the names of tensors being dequantized (the first input of each DequantizeLinear node, excluding scale and zero-point inputs).
- Parameters:
onnx_model (ModelProto) – ONNX model protobuf to analyze
- Returns:
Set of tensor names that are inputs to DequantizeLinear nodes (i.e., the tensors being dequantized)
- Return type:
set[str]
- get_tensor_dtype(num_bits=4, has_zero_point=False)
Get the appropriate tensor dtype based on precision info and zero point presence.
- Parameters:
num_bits (int) – Number of bits for quantization
has_zero_point (bool) – Whether the tensor has a zero point
- Returns:
ONNX tensor data type constant
- Return type:
int
- has_qdq_nodes(onnx_model)
Check if the onnx graph already has QDQ nodes.
- Parameters:
onnx_model (ModelProto)
- insert_dq_nodes(graph, scales, quantized_weights, attributes=None, zero_points=None, layer_info=None)
Insert new initializers and DQ nodes into graph.
- Parameters:
graph (Graph) – The graph to modify.
weights – A map from ONNX initializer name to tensor.
scales (dict[str, ndarray]) – A map from ONNX initializer name to desired scale factor for that initializer.
dq_only – Whether to only insert dq nodes.
layer_info (dict[str, dict] | None) – Optional dictionary mapping tensor names to precision (old format) or to layer configuration dict (new format with precision, block_size, axis).
quantized_weights (dict[str, ndarray])
attributes (dict[str, Any] | None)
zero_points (dict[str, ndarray] | None)
- insert_pre_quant_scale_nodes(graph, input_tensors, pre_quant_scale)
Insert new mul nodes into graph.
- Parameters:
graph (Graph) – The graph to modify.
input_tensors (dict[str, str]) – A dictionary of weight tensor names mapped to corresponding input tensor names
pre_quant_scale (dict[str, ndarray]) – A map from ONNX input tensor name to corresponding pre-quant scale.
- insert_qdq_nodes(graph, scales, weight_map, layer_info=None)
Insert scales and QDQ nodes into graph.
- Parameters:
graph (Graph) – The graph to modify.
scales (dict[str, ndarray]) – A map from ONNX initializer name to desired scale factor for that initializer.
weight_map (dict[str, Tensor]) – A map from ONNX initializer name to graphsurgeon tensor.
layer_info (dict[str, dict] | None) – Optional dictionary mapping tensor names to precision (old format) or to layer configuration dict (new format with precision, block_size, axis).
- insert_transpose_nodes_for_column_major(graph)
Add a single Transpose node after each DequantizeLinear for column-major weights.
This implements the simple transformation: A @ B = A @ ((B^T)^T) where B^T is stored in the DequantizeLinear node, and we add a Transpose node after DQ to recover B before the MatMul.
- Graph transformation:
Before: DQ(W) -> MatMul/Gemm After: DQ(W^T) -> Transpose -> W -> MatMul/Gemm
- Parameters:
graph (Graph) – ONNX GraphSurgeon graph to modify in-place
- make_gs_awq_scale(name, scale)
Create a GraphSurgeon scale tensor from the given numpy array.
name is the desired _basename_ of the tensor.
- Parameters:
name (str)
scale (ndarray)
- Return type:
Constant
- make_gs_dequantize_node(name, inputs, outputs, attributes=None)
Create a GraphSurgeon Dequantize node.
name is the desired _basename_ of the node.
- Parameters:
name (str)
inputs (Sequence[Tensor])
outputs (Sequence[Tensor])
attributes (dict[str, Any] | None)
- Return type:
Node
- make_gs_dequantize_output(name, shape, dtype)
Create a GraphSurgeon variable representing the output of a quantize node.
name is the desired _basename_ of the node.
- Parameters:
name (str)
shape (Sequence[int])
dtype (dtype)
- Return type:
Variable
- make_gs_pre_quant_scale_node(name, inputs, outputs)
Create a GraphSurgeon Dequantize node.
name is the desired _basename_ of the node.
- Parameters:
name (str)
inputs (Sequence[Tensor])
outputs (Sequence[Tensor])
- Return type:
Node
- make_gs_pre_quant_scale_output(name, shape, dtype)
Create a GraphSurgeon variable representing the output of a quantize node.
name is the desired _basename_ of the node.
- Parameters:
name (str)
shape (Sequence[int])
dtype (dtype)
- Return type:
Variable
- make_gs_quantize_node(name, inputs, outputs)
Create a GraphSurgeon Quantize node.
name is the desired _basename_ of the node.
- Parameters:
name (str)
inputs (Sequence[Tensor])
outputs (Sequence[Tensor])
- Return type:
Node
- make_gs_quantize_output(name, shape, dtype)
Create a GraphSurgeon variable representing the output of a quantize node.
name is the desired _basename_ of the node.
- Parameters:
name (str)
shape (Sequence[int])
dtype (<google.protobuf.internal.enum_type_wrapper.EnumTypeWrapper object at 0x7f9b5da27980>)
- Return type:
Variable
- make_gs_quantized_weight(name, wq, dtype)
Create a GraphSurgeon tensor from a quantized weight tensor.
name is the desired _basename_ of the tensor.
- Parameters:
name (str)
wq (ndarray)
- Return type:
Constant
- make_gs_scale(name, scale)
Create a GraphSurgeon scale tensor from the given numpy array.
name is the desired _basename_ of the tensor.
- Parameters:
name (str)
scale (ndarray)
- Return type:
Constant
- make_gs_zp(name, shape, dtype)
Create a GraphSurgeon zero-point tensor of all zeroes with the given shape.
name is the desired _basename_ of the tensor.
- Parameters:
name (str)
shape (Sequence[int])
- Return type:
Constant
- qdq_to_dq(onnx_model)
Convert FP32/FP16 weights of the given ONNX model to INT8/FP8 weights.
This function converts a model with QDQ (QuantizeLinear-DequantizeLinear) nodes to a model with only DQ nodes for weights. It: 1. Converts FP32/FP16 weights to INT8/FP8 2. Updates the graph to maintain proper connections 3. Removes redundant cast nodes in the quantized model (additional optimization for diffusers)
- Parameters:
onnx_model (ModelProto) – ONNX model protobuf to convert
- Returns:
ONNX model protobuf with only DQ nodes for weights
- Raises:
ValueError – If the model is invalid or conversion fails
RuntimeError – If graph operations fail
- Return type:
ModelProto
- remove_graph_input_q(onnx_model)
Remove Q nodes from the inputs of a quantized ONNX model.
This supports generating quantized models with low-precision graph I/O.
- Parameters:
onnx_model (ModelProto) – ONNX model protobuf to convert
- Returns:
ONNX model protobuf with only DQ in the inputs whenever possible.
- Raises:
ValueError – If the model is invalid or removal fails
RuntimeError – If graph operations fail
- Return type:
ModelProto
- remove_input_dq_and_output_q(onnx_model, quantizable_custom_ops)
Remove DQ nodes from the input and Q from the output of quantized custom ops for TensorRT compatibility.
TensorRT requires only Q nodes in the inputs and only DQ nodes in the outputs of custom ops. For more information, see https://docs.nvidia.com/deeplearning/tensorrt/latest/inference-library/work-quantized-types.html#q-dq-interaction-with-plugins
- Parameters:
onnx_model (ModelProto) – ONNX model protobuf to convert
quantizable_custom_ops (dict) – dictionary of custom ops and I/O indices to perform Q and DQ deletions as needed.
- Returns:
ONNX model protobuf with only Q in the inputs and only DQ in the outputs of custom ops.
- Raises:
ValueError – If the model is invalid or removal fails
RuntimeError – If graph operations fail
- Return type:
ModelProto
- replace_scale_values(graph, act_scales_dict)
Replace scale values in the graph with values from calibration cache.
- Parameters:
graph (GraphProto) – ONNX graph to modify
act_scales_dict (dict[str, float]) – Dictionary mapping scale tensor names to their new values
- Return type:
None
- replace_zero_scale_with_smallest_nonzero(onnx_model)
Replace zero scale values with smallest nonzero fp16 value in the ONNX model.
- Parameters:
onnx_model (ModelProto)
- Return type:
ModelProto
- update_attributes_for_per_channel_nodes(attributes=None, num_bits=4)
Get the attributes for per-channel nodes.
- Parameters:
attributes (dict[str, Any] | None)
num_bits (int)
- Return type:
dict[str, Any] | None
- use_trt_qdq_ops()
Globally set node names to TRT custom names.
- validate_scale_shape_for_per_channel_nodes(scale, attrs=None, num_bits=4)
Validate the shape of the scale tensor for per-channel nodes.
- Parameters:
scale (ndarray)
attrs (dict[str, Any] | None)
num_bits (int)