qdq_utils
Various utils to support inserting Q/DQ nodes.
Functions
| Convert FP32/FP16 weights of the given ONNX model to FP4 weights and scaling factors. | |
| Get the appropriate tensor dtype based on precision info and zero point presence. | |
| Check if the onnx graph already has QDQ nodes. | |
| Insert new initializers and DQ nodes into graph. | |
| Insert new mul nodes into graph. | |
| Insert scales and QDQ nodes into graph. | |
| Create a GraphSurgeon scale tensor from the given numpy array. | |
| Create a GraphSurgeon Dequantize node. | |
| Create a GraphSurgeon variable representing the output of a quantize node. | |
| Create a GraphSurgeon Dequantize node. | |
| Create a GraphSurgeon variable representing the output of a quantize node. | |
| Create a GraphSurgeon Quantize node. | |
| Create a GraphSurgeon variable representing the output of a quantize node. | |
| Create a GraphSurgeon tensor from a quantized weight tensor. | |
| Create a GraphSurgeon scale tensor from the given numpy array. | |
| Create a GraphSurgeon zero-point tensor of all zeroes with the given shape. | |
| Convert FP32/FP16 weights of the given ONNX model to INT8/FP8 weights. | |
| Converts ONNX model weights from higher precision to INT4 precision with graph optimization. | |
| Converts the weights to FP8 precision using MXFP8 quantization. | |
| Remove Q nodes from the inputs of a quantized ONNX model. | |
| Remove DQ nodes from the input and Q from the output of quantized custom ops for TensorRT compatibility. | |
| Replaces the given node in the ONNX graph with a subgraph consisting of two DequantizeLinear nodes. | |
| Replace scale values in the graph with values from calibration cache. | |
| Get the attributes for per-channel nodes. | |
| Globally set node names to TRT custom names. | |
| Validate the shape of the scale tensor for per-channel nodes. | 
- fp4qdq_to_2dq(onnx_model, verbose=False)
- Convert FP32/FP16 weights of the given ONNX model to FP4 weights and scaling factors. - TRT_FP4QDQ nodes will get removed from the weights and have two DQ nodes with those converted FP4 weights and scaling factors in the output model. - Parameters:
- onnx_model (ModelProto) – ONNX model protobuf. 
- verbose (bool) 
 
- Returns:
- ONNX model protobuf with DQ nodes for weights and DynQ + DQ nodes for activations. 
- Return type:
- ModelProto 
 
- get_tensor_dtype(num_bits=4, has_zero_point=False)
- Get the appropriate tensor dtype based on precision info and zero point presence. - Parameters:
- num_bits (int) – Number of bits for quantization 
- has_zero_point (bool) – Whether the tensor has a zero point 
 
- Returns:
- ONNX tensor data type constant 
- Return type:
- int 
 
- has_qdq_nodes(onnx_model)
- Check if the onnx graph already has QDQ nodes. - Parameters:
- onnx_model (ModelProto) 
 
- insert_dq_nodes(graph, scales, quantized_weights, attributes=None, zero_points=None, layer_info=None)
- Insert new initializers and DQ nodes into graph. - Parameters:
- graph (Graph) – The graph to modify. 
- weights – A map from ONNX initializer name to tensor. 
- scales (dict[str, ndarray]) – A map from ONNX initializer name to desired scale factor for that initializer. 
- dq_only – Whether to only insert dq nodes. 
- layer_info (dict[str, dict] | None) – Optional dictionary mapping tensor names to precision (old format) or to layer configuration dict (new format with precision, block_size, axis). 
- quantized_weights (dict[str, ndarray]) 
- attributes (dict[str, Any] | None) 
- zero_points (dict[str, ndarray] | None) 
 
 
- insert_pre_quant_scale_nodes(graph, input_tensors, pre_quant_scale)
- Insert new mul nodes into graph. - Parameters:
- graph (Graph) – The graph to modify. 
- input_tensors (dict[str, str]) – A dictionary of weight tensor names mapped to corresponding input tensor names 
- pre_quant_scale (dict[str, ndarray]) – A map from ONNX input tensor name to corresponding pre-quant scale. 
 
 
- insert_qdq_nodes(graph, scales, weight_map, layer_info=None)
- Insert scales and QDQ nodes into graph. - Parameters:
- graph (Graph) – The graph to modify. 
- scales (dict[str, ndarray]) – A map from ONNX initializer name to desired scale factor for that initializer. 
- weight_map (dict[str, Tensor]) – A map from ONNX initializer name to graphsurgeon tensor. 
- layer_info (dict[str, dict] | None) – Optional dictionary mapping tensor names to precision (old format) or to layer configuration dict (new format with precision, block_size, axis). 
 
 
- make_gs_awq_scale(name, scale)
- Create a GraphSurgeon scale tensor from the given numpy array. - name is the desired _basename_ of the tensor. - Parameters:
- name (str) 
- scale (ndarray) 
 
- Return type:
- Constant 
 
- make_gs_dequantize_node(name, inputs, outputs, attributes=None)
- Create a GraphSurgeon Dequantize node. - name is the desired _basename_ of the node. - Parameters:
- name (str) 
- inputs (Sequence[Tensor]) 
- outputs (Sequence[Tensor]) 
- attributes (dict[str, Any] | None) 
 
- Return type:
- Node 
 
- make_gs_dequantize_output(name, shape, dtype)
- Create a GraphSurgeon variable representing the output of a quantize node. - name is the desired _basename_ of the node. - Parameters:
- name (str) 
- shape (Sequence[int]) 
- dtype (dtype) 
 
- Return type:
- Variable 
 
- make_gs_pre_quant_scale_node(name, inputs, outputs)
- Create a GraphSurgeon Dequantize node. - name is the desired _basename_ of the node. - Parameters:
- name (str) 
- inputs (Sequence[Tensor]) 
- outputs (Sequence[Tensor]) 
 
- Return type:
- Node 
 
- make_gs_pre_quant_scale_output(name, shape, dtype)
- Create a GraphSurgeon variable representing the output of a quantize node. - name is the desired _basename_ of the node. - Parameters:
- name (str) 
- shape (Sequence[int]) 
- dtype (dtype) 
 
- Return type:
- Variable 
 
- make_gs_quantize_node(name, inputs, outputs)
- Create a GraphSurgeon Quantize node. - name is the desired _basename_ of the node. - Parameters:
- name (str) 
- inputs (Sequence[Tensor]) 
- outputs (Sequence[Tensor]) 
 
- Return type:
- Node 
 
- make_gs_quantize_output(name, shape, dtype)
- Create a GraphSurgeon variable representing the output of a quantize node. - name is the desired _basename_ of the node. - Parameters:
- name (str) 
- shape (Sequence[int]) 
- dtype (<google.protobuf.internal.enum_type_wrapper.EnumTypeWrapper object at 0x7fe23be5a240>) 
 
- Return type:
- Variable 
 
- make_gs_quantized_weight(name, wq, dtype)
- Create a GraphSurgeon tensor from a quantized weight tensor. - name is the desired _basename_ of the tensor. - Parameters:
- name (str) 
- wq (ndarray) 
 
- Return type:
- Constant 
 
- make_gs_scale(name, scale)
- Create a GraphSurgeon scale tensor from the given numpy array. - name is the desired _basename_ of the tensor. - Parameters:
- name (str) 
- scale (ndarray) 
 
- Return type:
- Constant 
 
- make_gs_zp(name, shape, dtype)
- Create a GraphSurgeon zero-point tensor of all zeroes with the given shape. - name is the desired _basename_ of the tensor. - Parameters:
- name (str) 
- shape (Sequence[int]) 
 
- Return type:
- Constant 
 
- qdq_to_dq(onnx_model)
- Convert FP32/FP16 weights of the given ONNX model to INT8/FP8 weights. - This function converts a model with QDQ (QuantizeLinear-DequantizeLinear) nodes to a model with only DQ nodes for weights. It: 1. Converts FP32/FP16 weights to INT8/FP8 2. Updates the graph to maintain proper connections 3. Removes redundant cast nodes in the quantized model (additional optimization for diffusers) - Parameters:
- onnx_model (ModelProto) – ONNX model protobuf to convert 
- Returns:
- ONNX model protobuf with only DQ nodes for weights 
- Raises:
- ValueError – If the model is invalid or conversion fails 
- RuntimeError – If graph operations fail 
 
- Return type:
- ModelProto 
 
- quantize_weights_to_int4(onnx_model)
- Converts ONNX model weights from higher precision to INT4 precision with graph optimization. - This function performs a comprehensive transformation of quantized weights in an ONNX model: 1. Identifies DequantizeLinear nodes that represent quantized weights 2. Extracts and processes weights and their corresponding scales 3. Simplifies the graph by removing unnecessary Reshape/Transpose operations 4. Converts weights to INT4 precision while maintaining numerical accuracy 5. Updates Cast operations to use float16 instead of float32 - The transformation optimizes the typical pattern: DequantizeLinear -> Reshape -> Transpose -> MatMul/Gemm Into the simplified pattern: DequantizeLinear -> MatMul/Gemm - Parameters:
- onnx_model (onnx.ModelProto) – Input ONNX model containing quantized weights. 
- Returns:
- Weights converted to INT4 precision 
- Return type:
- onnx.ModelProto 
 
- quantize_weights_to_mxfp8(onnx_model)
- Converts the weights to FP8 precision using MXFP8 quantization. - For TRT_MXFP8DynamicQuantize, we update the output type to FP8. For TRT_MXFP8DequantizeLinear, we compute the scales in e8m0 format and saves them as a new initializer. We then expand the scale to the same shape as the weight and divide the weight by the scale to get the FP8 weights. - Parameters:
- graph – ONNX model protobuf. 
- onnx_model (ModelProto) 
 
- Returns:
- ONNX model protobuf with weights quantized to FP8 precision using MXFP8 quantization. 
- Return type:
- ModelProto 
 
- remove_graph_input_q(onnx_model)
- Remove Q nodes from the inputs of a quantized ONNX model. - This supports generating quantized models with low-precision graph I/O. - Parameters:
- onnx_model (ModelProto) – ONNX model protobuf to convert 
- Returns:
- ONNX model protobuf with only DQ in the inputs whenever possible. 
- Raises:
- ValueError – If the model is invalid or removal fails 
- RuntimeError – If graph operations fail 
 
- Return type:
- ModelProto 
 
- remove_input_dq_and_output_q(onnx_model, quantizable_custom_ops)
- Remove DQ nodes from the input and Q from the output of quantized custom ops for TensorRT compatibility. - TensorRT requires only Q nodes in the inputs and only DQ nodes in the outputs of custom ops. For more information, see https://docs.nvidia.com/deeplearning/tensorrt/latest/inference-library/work-quantized-types.html#q-dq-interaction-with-plugins - Parameters:
- onnx_model (ModelProto) – ONNX model protobuf to convert 
- quantizable_custom_ops (dict) – dictionary of custom ops and I/O indices to perform Q and DQ deletions as needed. 
 
- Returns:
- ONNX model protobuf with only Q in the inputs and only DQ in the outputs of custom ops. 
- Raises:
- ValueError – If the model is invalid or removal fails 
- RuntimeError – If graph operations fail 
 
- Return type:
- ModelProto 
 
- replace_fp4qdq_with_2dq(graph, node, initializer_indices, value_info_map, graph_inputs, w_f4, sw_f32_per_tensor, sw_f8_per_block, precision_dtype, block_size)
- Replaces the given node in the ONNX graph with a subgraph consisting of two DequantizeLinear nodes. - Parameters:
- graph (GraphProto) – The ONNX graph containing the node to replace. 
- node (NodeProto) – The node to be replaced. 
- initializer_indices (dict[str, int]) – A dictionary mapping initializer names to their indices in the graph. 
- value_info_map (dict[str, ValueInfoProto]) – A dictionary mapping value info names to their ValueInfoProto objects. 
- graph_inputs (set[str]) – A set of graph input names. 
- w_f4 (ndarray) – NumPy array for w_f4. 
- sw_f32_per_tensor (ndarray) – NumPy array for sw_f32_per_tensor. 
- sw_f8_per_block (ndarray) – NumPy array for sw_f8_per_block. 
- precision_dtype (str) – The precision of the weights. 
- block_size (int) – Block size used in block quantization. 
 
 
- replace_scale_values(graph, act_scales_dict)
- Replace scale values in the graph with values from calibration cache. - Parameters:
- graph (GraphProto) – ONNX graph to modify 
- act_scales_dict (dict[str, float]) – Dictionary mapping scale tensor names to their new values 
 
- Return type:
- None 
 
- update_attributes_for_per_channel_nodes(attributes=None, num_bits=4)
- Get the attributes for per-channel nodes. - Parameters:
- attributes (dict[str, Any] | None) 
- num_bits (int) 
 
- Return type:
- dict[str, Any] | None 
 
- use_trt_qdq_ops()
- Globally set node names to TRT custom names. 
- validate_scale_shape_for_per_channel_nodes(scale, attrs=None, num_bits=4)
- Validate the shape of the scale tensor for per-channel nodes. - Parameters:
- scale (ndarray) 
- attrs (dict[str, Any] | None) 
- num_bits (int)