qdq_utils
Various utils to support inserting Q/DQ nodes.
Functions
Casts the initializer to the given dtype. |
|
Get the appropriate tensor dtype based on precision info and zero point presence. |
|
Check if the onnx graph already has QDQ nodes. |
|
Insert new initializers and DQ nodes into graph. |
|
Insert new mul nodes into graph. |
|
Insert scales and QDQ nodes into graph. |
|
Create a GraphSurgeon scale tensor from the given numpy array. |
|
Create a GraphSurgeon Dequantize node. |
|
Create a GraphSurgeon variable representing the output of a quantize node. |
|
Create a GraphSurgeon Dequantize node. |
|
Create a GraphSurgeon variable representing the output of a quantize node. |
|
Create a GraphSurgeon Quantize node. |
|
Create a GraphSurgeon variable representing the output of a quantize node. |
|
Create a GraphSurgeon tensor from a quantized weight tensor. |
|
Create a GraphSurgeon scale tensor from the given numpy array. |
|
Create a GraphSurgeon zero-point tensor of all zeroes with the given shape. |
|
Convert FP32/FP16 weights of the given ONNX model to INT8/FP8 weights. |
|
Remove Q nodes from the inputs of a quantized ONNX model. |
|
Remove DQ nodes from the input and Q from the output of quantized custom ops for TensorRT compatibility. |
|
Replace scale values in the graph with values from calibration cache. |
|
Replace zero scale values with smallest nonzero fp16 value in the ONNX model. |
|
Get the attributes for per-channel nodes. |
|
Globally set node names to TRT custom names. |
|
Validate the shape of the scale tensor for per-channel nodes. |
- cast_initializer_to_dtype(node, dtype, initializer_map)
Casts the initializer to the given dtype.
- Parameters:
node (NodeProto)
dtype (str)
initializer_map (dict[str, TensorProto])
- get_tensor_dtype(num_bits=4, has_zero_point=False)
Get the appropriate tensor dtype based on precision info and zero point presence.
- Parameters:
num_bits (int) – Number of bits for quantization
has_zero_point (bool) – Whether the tensor has a zero point
- Returns:
ONNX tensor data type constant
- Return type:
int
- has_qdq_nodes(onnx_model)
Check if the onnx graph already has QDQ nodes.
- Parameters:
onnx_model (ModelProto)
- insert_dq_nodes(graph, scales, quantized_weights, attributes=None, zero_points=None, layer_info=None)
Insert new initializers and DQ nodes into graph.
- Parameters:
graph (Graph) – The graph to modify.
weights – A map from ONNX initializer name to tensor.
scales (dict[str, ndarray]) – A map from ONNX initializer name to desired scale factor for that initializer.
dq_only – Whether to only insert dq nodes.
layer_info (dict[str, dict] | None) – Optional dictionary mapping tensor names to precision (old format) or to layer configuration dict (new format with precision, block_size, axis).
quantized_weights (dict[str, ndarray])
attributes (dict[str, Any] | None)
zero_points (dict[str, ndarray] | None)
- insert_pre_quant_scale_nodes(graph, input_tensors, pre_quant_scale)
Insert new mul nodes into graph.
- Parameters:
graph (Graph) – The graph to modify.
input_tensors (dict[str, str]) – A dictionary of weight tensor names mapped to corresponding input tensor names
pre_quant_scale (dict[str, ndarray]) – A map from ONNX input tensor name to corresponding pre-quant scale.
- insert_qdq_nodes(graph, scales, weight_map, layer_info=None)
Insert scales and QDQ nodes into graph.
- Parameters:
graph (Graph) – The graph to modify.
scales (dict[str, ndarray]) – A map from ONNX initializer name to desired scale factor for that initializer.
weight_map (dict[str, Tensor]) – A map from ONNX initializer name to graphsurgeon tensor.
layer_info (dict[str, dict] | None) – Optional dictionary mapping tensor names to precision (old format) or to layer configuration dict (new format with precision, block_size, axis).
- make_gs_awq_scale(name, scale)
Create a GraphSurgeon scale tensor from the given numpy array.
name is the desired _basename_ of the tensor.
- Parameters:
name (str)
scale (ndarray)
- Return type:
Constant
- make_gs_dequantize_node(name, inputs, outputs, attributes=None)
Create a GraphSurgeon Dequantize node.
name is the desired _basename_ of the node.
- Parameters:
name (str)
inputs (Sequence[Tensor])
outputs (Sequence[Tensor])
attributes (dict[str, Any] | None)
- Return type:
Node
- make_gs_dequantize_output(name, shape, dtype)
Create a GraphSurgeon variable representing the output of a quantize node.
name is the desired _basename_ of the node.
- Parameters:
name (str)
shape (Sequence[int])
dtype (dtype)
- Return type:
Variable
- make_gs_pre_quant_scale_node(name, inputs, outputs)
Create a GraphSurgeon Dequantize node.
name is the desired _basename_ of the node.
- Parameters:
name (str)
inputs (Sequence[Tensor])
outputs (Sequence[Tensor])
- Return type:
Node
- make_gs_pre_quant_scale_output(name, shape, dtype)
Create a GraphSurgeon variable representing the output of a quantize node.
name is the desired _basename_ of the node.
- Parameters:
name (str)
shape (Sequence[int])
dtype (dtype)
- Return type:
Variable
- make_gs_quantize_node(name, inputs, outputs)
Create a GraphSurgeon Quantize node.
name is the desired _basename_ of the node.
- Parameters:
name (str)
inputs (Sequence[Tensor])
outputs (Sequence[Tensor])
- Return type:
Node
- make_gs_quantize_output(name, shape, dtype)
Create a GraphSurgeon variable representing the output of a quantize node.
name is the desired _basename_ of the node.
- Parameters:
name (str)
shape (Sequence[int])
dtype (<google.protobuf.internal.enum_type_wrapper.EnumTypeWrapper object at 0x7f738c440230>)
- Return type:
Variable
- make_gs_quantized_weight(name, wq, dtype)
Create a GraphSurgeon tensor from a quantized weight tensor.
name is the desired _basename_ of the tensor.
- Parameters:
name (str)
wq (ndarray)
- Return type:
Constant
- make_gs_scale(name, scale)
Create a GraphSurgeon scale tensor from the given numpy array.
name is the desired _basename_ of the tensor.
- Parameters:
name (str)
scale (ndarray)
- Return type:
Constant
- make_gs_zp(name, shape, dtype)
Create a GraphSurgeon zero-point tensor of all zeroes with the given shape.
name is the desired _basename_ of the tensor.
- Parameters:
name (str)
shape (Sequence[int])
- Return type:
Constant
- qdq_to_dq(onnx_model)
Convert FP32/FP16 weights of the given ONNX model to INT8/FP8 weights.
This function converts a model with QDQ (QuantizeLinear-DequantizeLinear) nodes to a model with only DQ nodes for weights. It: 1. Converts FP32/FP16 weights to INT8/FP8 2. Updates the graph to maintain proper connections 3. Removes redundant cast nodes in the quantized model (additional optimization for diffusers)
- Parameters:
onnx_model (ModelProto) – ONNX model protobuf to convert
- Returns:
ONNX model protobuf with only DQ nodes for weights
- Raises:
ValueError – If the model is invalid or conversion fails
RuntimeError – If graph operations fail
- Return type:
ModelProto
- remove_graph_input_q(onnx_model)
Remove Q nodes from the inputs of a quantized ONNX model.
This supports generating quantized models with low-precision graph I/O.
- Parameters:
onnx_model (ModelProto) – ONNX model protobuf to convert
- Returns:
ONNX model protobuf with only DQ in the inputs whenever possible.
- Raises:
ValueError – If the model is invalid or removal fails
RuntimeError – If graph operations fail
- Return type:
ModelProto
- remove_input_dq_and_output_q(onnx_model, quantizable_custom_ops)
Remove DQ nodes from the input and Q from the output of quantized custom ops for TensorRT compatibility.
TensorRT requires only Q nodes in the inputs and only DQ nodes in the outputs of custom ops. For more information, see https://docs.nvidia.com/deeplearning/tensorrt/latest/inference-library/work-quantized-types.html#q-dq-interaction-with-plugins
- Parameters:
onnx_model (ModelProto) – ONNX model protobuf to convert
quantizable_custom_ops (dict) – dictionary of custom ops and I/O indices to perform Q and DQ deletions as needed.
- Returns:
ONNX model protobuf with only Q in the inputs and only DQ in the outputs of custom ops.
- Raises:
ValueError – If the model is invalid or removal fails
RuntimeError – If graph operations fail
- Return type:
ModelProto
- replace_scale_values(graph, act_scales_dict)
Replace scale values in the graph with values from calibration cache.
- Parameters:
graph (GraphProto) – ONNX graph to modify
act_scales_dict (dict[str, float]) – Dictionary mapping scale tensor names to their new values
- Return type:
None
- replace_zero_scale_with_smallest_nonzero(onnx_model)
Replace zero scale values with smallest nonzero fp16 value in the ONNX model.
- Parameters:
onnx_model (ModelProto)
- Return type:
ModelProto
- update_attributes_for_per_channel_nodes(attributes=None, num_bits=4)
Get the attributes for per-channel nodes.
- Parameters:
attributes (dict[str, Any] | None)
num_bits (int)
- Return type:
dict[str, Any] | None
- use_trt_qdq_ops()
Globally set node names to TRT custom names.
- validate_scale_shape_for_per_channel_nodes(scale, attrs=None, num_bits=4)
Validate the shape of the scale tensor for per-channel nodes.
- Parameters:
scale (ndarray)
attrs (dict[str, Any] | None)
num_bits (int)