qdq_utils

Various utils to support inserting Q/DQ nodes.

Functions

`apply_column_major_transformation`	Transpose quantized weights and scales in-place for column-major storage.
`cast_initializer_to_dtype`	Casts the initializer to the given dtype.
`get_quantized_tensors`	Get the names of all quantized tensors from an ONNX model.
`get_tensor_dtype`	Get the appropriate tensor dtype based on precision info and zero point presence.
`has_qdq_nodes`	Check if the onnx graph already has QDQ nodes.
`insert_dq_nodes`	Insert new initializers and DQ nodes into graph.
`insert_pre_quant_scale_nodes`	Insert new mul nodes into graph.
`insert_qdq_nodes`	Insert scales and QDQ nodes into graph.
`insert_transpose_nodes_for_column_major`	Add a single Transpose node after each DequantizeLinear for column-major weights.
`make_gs_awq_scale`	Create a GraphSurgeon scale tensor from the given numpy array.
`make_gs_dequantize_node`	Create a GraphSurgeon Dequantize node.
`make_gs_dequantize_output`	Create a GraphSurgeon variable representing the output of a quantize node.
`make_gs_pre_quant_scale_node`	Create a GraphSurgeon Dequantize node.
`make_gs_pre_quant_scale_output`	Create a GraphSurgeon variable representing the output of a quantize node.
`make_gs_quantize_node`	Create a GraphSurgeon Quantize node.
`make_gs_quantize_output`	Create a GraphSurgeon variable representing the output of a quantize node.
`make_gs_quantized_weight`	Create a GraphSurgeon tensor from a quantized weight tensor.
`make_gs_scale`	Create a GraphSurgeon scale tensor from the given numpy array.
`make_gs_zp`	Create a GraphSurgeon zero-point tensor of all zeroes with the given shape.
`qdq_to_dq`	Convert FP32/FP16 weights of the given ONNX model to INT8/FP8 weights.
`remove_graph_input_q`	Remove Q nodes from the inputs of a quantized ONNX model.
`remove_input_dq_and_output_q`	Remove DQ nodes from the input and Q from the output of quantized custom ops for TensorRT compatibility.
`replace_scale_values`	Replace scale values in the graph with values from calibration cache.
`replace_zero_scale_with_smallest_nonzero`	Replace zero scale values with smallest nonzero fp16 value in the ONNX model.
`update_attributes_for_per_channel_nodes`	Get the attributes for per-channel nodes.
`use_trt_qdq_ops`	Globally set node names to TRT custom names.
`validate_scale_shape_for_per_channel_nodes`	Validate the shape of the scale tensor for per-channel nodes.

apply_column_major_transformation(gemm_weights_quantized, scales)

Transpose quantized weights and scales in-place for column-major storage.

Note: After calling this function and inserting DQ nodes with axis=1, you should call insert_transpose_nodes_for_column_major() on the graph.

Parameters:

gemm_weights_quantized (dict) – Dictionary mapping weight names to quantized weight arrays
scales (dict) – Dictionary mapping weight names to scale arrays

Return type:

None

cast_initializer_to_dtype(node, dtype, initializer_map)

Casts the initializer to the given dtype.

Parameters:

node (NodeProto)
dtype (str)
initializer_map (dict[str, TensorProto])

get_quantized_tensors(onnx_model)

Get the names of all quantized tensors from an ONNX model.

This function identifies all DequantizeLinear nodes in the ONNX model and extracts the names of tensors being dequantized (the first input of each DequantizeLinear node, excluding scale and zero-point inputs).

Parameters:: onnx_model (ModelProto) – ONNX model protobuf to analyze
Returns:: Set of tensor names that are inputs to DequantizeLinear nodes (i.e., the tensors being dequantized)
Return type:: set[str]

get_tensor_dtype(num_bits=4, has_zero_point=False)

Get the appropriate tensor dtype based on precision info and zero point presence.

Parameters:

num_bits (int) – Number of bits for quantization
has_zero_point (bool) – Whether the tensor has a zero point

Returns:

ONNX tensor data type constant

Return type:

int

has_qdq_nodes(onnx_model)

Check if the onnx graph already has QDQ nodes.

Parameters:: onnx_model (ModelProto)

insert_dq_nodes(graph, scales, quantized_weights, attributes=None, zero_points=None, layer_info=None)

Insert new initializers and DQ nodes into graph.

Parameters:

graph (Graph) – The graph to modify.
weights – A map from ONNX initializer name to tensor.
scales (dict[str, ndarray]) – A map from ONNX initializer name to desired scale factor for that initializer.
dq_only – Whether to only insert dq nodes.
layer_info (dict[str, dict] | None) – Optional dictionary mapping tensor names to precision (old format) or to layer configuration dict (new format with precision, block_size, axis).
quantized_weights (dict[str, ndarray])
attributes (dict[str, Any] | None)
zero_points (dict[str, ndarray] | None)

insert_pre_quant_scale_nodes(graph, input_tensors, pre_quant_scale)

Insert new mul nodes into graph.

Parameters:

graph (Graph) – The graph to modify.
input_tensors (dict[str, str]) – A dictionary of weight tensor names mapped to corresponding input tensor names
pre_quant_scale (dict[str, ndarray]) – A map from ONNX input tensor name to corresponding pre-quant scale.

insert_qdq_nodes(graph, scales, weight_map, layer_info=None)

Insert scales and QDQ nodes into graph.

Parameters:

graph (Graph) – The graph to modify.
scales (dict[str, ndarray]) – A map from ONNX initializer name to desired scale factor for that initializer.
weight_map (dict[str, Tensor]) – A map from ONNX initializer name to graphsurgeon tensor.
layer_info (dict[str, dict] | None) – Optional dictionary mapping tensor names to precision (old format) or to layer configuration dict (new format with precision, block_size, axis).

insert_transpose_nodes_for_column_major(graph)

Add a single Transpose node after each DequantizeLinear for column-major weights.

This implements the simple transformation: A @ B = A @ ((B^T)^T) where B^T is stored in the DequantizeLinear node, and we add a Transpose node after DQ to recover B before the MatMul.

Graph transformation:: Before: DQ(W) -> MatMul/Gemm After: DQ(W^T) -> Transpose -> W -> MatMul/Gemm

Parameters:: graph (Graph) – ONNX GraphSurgeon graph to modify in-place

make_gs_awq_scale(name, scale)

Create a GraphSurgeon scale tensor from the given numpy array.

name is the desired _basename_ of the tensor.

Parameters:

name (str)
scale (ndarray)

Return type:

Constant

make_gs_dequantize_node(name, inputs, outputs, attributes=None)

Create a GraphSurgeon Dequantize node.

name is the desired _basename_ of the node.

Parameters:

name (str)
inputs (Sequence[Tensor])
outputs (Sequence[Tensor])
attributes (dict[str, Any] | None)

Return type:

Node

make_gs_dequantize_output(name, shape, dtype)

Create a GraphSurgeon variable representing the output of a quantize node.

name is the desired _basename_ of the node.

Parameters:

name (str)
shape (Sequence[int])
dtype (dtype)

Return type:

Variable

make_gs_pre_quant_scale_node(name, inputs, outputs)

Create a GraphSurgeon Dequantize node.

name is the desired _basename_ of the node.

Parameters:

name (str)
inputs (Sequence[Tensor])
outputs (Sequence[Tensor])

Return type:

Node

make_gs_pre_quant_scale_output(name, shape, dtype)

Create a GraphSurgeon variable representing the output of a quantize node.

name is the desired _basename_ of the node.

Parameters:

name (str)
shape (Sequence[int])
dtype (dtype)

Return type:

Variable

make_gs_quantize_node(name, inputs, outputs)

Create a GraphSurgeon Quantize node.

name is the desired _basename_ of the node.

Parameters:

name (str)
inputs (Sequence[Tensor])
outputs (Sequence[Tensor])

Return type:

Node

make_gs_quantize_output(name, shape, dtype)

Create a GraphSurgeon variable representing the output of a quantize node.

name is the desired _basename_ of the node.

Parameters:

name (str)
shape (Sequence[int])
dtype (<google.protobuf.internal.enum_type_wrapper.EnumTypeWrapper object at 0x7f0a6af4ede0>)

Return type:

Variable

make_gs_quantized_weight(name, wq, dtype)

Create a GraphSurgeon tensor from a quantized weight tensor.

name is the desired _basename_ of the tensor.

Parameters:

name (str)
wq (ndarray)

Return type:

Constant

make_gs_scale(name, scale)

Create a GraphSurgeon scale tensor from the given numpy array.

name is the desired _basename_ of the tensor.

Parameters:

name (str)
scale (ndarray)

Return type:

Constant

make_gs_zp(name, shape, dtype)

Create a GraphSurgeon zero-point tensor of all zeroes with the given shape.

name is the desired _basename_ of the tensor.

Parameters:

name (str)
shape (Sequence[int])

Return type:

Constant

qdq_to_dq(onnx_model)

Convert FP32/FP16 weights of the given ONNX model to INT8/FP8 weights.

This function converts a model with QDQ (QuantizeLinear-DequantizeLinear) nodes to a model with only DQ nodes for weights. It: 1. Converts FP32/FP16 weights to INT8/FP8 2. Updates the graph to maintain proper connections 3. Removes redundant cast nodes in the quantized model (additional optimization for diffusers)

Parameters:

onnx_model (ModelProto) – ONNX model protobuf to convert

Returns:

ONNX model protobuf with only DQ nodes for weights

Raises:

ValueError – If the model is invalid or conversion fails
RuntimeError – If graph operations fail

Return type:

ModelProto

remove_graph_input_q(onnx_model)

Remove Q nodes from the inputs of a quantized ONNX model.

This supports generating quantized models with low-precision graph I/O.

Parameters:

onnx_model (ModelProto) – ONNX model protobuf to convert

Returns:

ONNX model protobuf with only DQ in the inputs whenever possible.

Raises:

ValueError – If the model is invalid or removal fails
RuntimeError – If graph operations fail

Return type:

ModelProto

remove_input_dq_and_output_q(onnx_model, quantizable_custom_ops)

Remove DQ nodes from the input and Q from the output of quantized custom ops for TensorRT compatibility.

TensorRT requires only Q nodes in the inputs and only DQ nodes in the outputs of custom ops. For more information, see https://docs.nvidia.com/deeplearning/tensorrt/latest/inference-library/work-quantized-types.html#q-dq-interaction-with-plugins

Parameters:

onnx_model (ModelProto) – ONNX model protobuf to convert
quantizable_custom_ops (dict) – dictionary of custom ops and I/O indices to perform Q and DQ deletions as needed.

Returns:

ONNX model protobuf with only Q in the inputs and only DQ in the outputs of custom ops.

Raises:

ValueError – If the model is invalid or removal fails
RuntimeError – If graph operations fail

Return type:

ModelProto

replace_scale_values(graph, act_scales_dict)

Replace scale values in the graph with values from calibration cache.

Parameters:

graph (GraphProto) – ONNX graph to modify
act_scales_dict (dict[str, float]) – Dictionary mapping scale tensor names to their new values

Return type:

None

replace_zero_scale_with_smallest_nonzero(onnx_model)

Replace zero scale values with smallest nonzero fp16 value in the ONNX model.

Parameters:: onnx_model (ModelProto)
Return type:: ModelProto

update_attributes_for_per_channel_nodes(attributes=None, num_bits=4)

Get the attributes for per-channel nodes.

Parameters:

attributes (dict[str, Any] | None)
num_bits (int)

Return type:

dict[str, Any] | None

use_trt_qdq_ops(): Globally set node names to TRT custom names.

validate_scale_shape_for_per_channel_nodes(scale, attrs=None, num_bits=4)

Validate the shape of the scale tensor for per-channel nodes.

Parameters:

scale (ndarray)
attrs (dict[str, Any] | None)
num_bits (int)