qdq_utils

Various utils to support inserting Q/DQ nodes.

Functions

apply_column_major_transformation

Transpose quantized weights and scales in-place for column-major storage.

cast_initializer_to_dtype

Casts the initializer to the given dtype.

get_quantized_tensors

Get the names of all quantized tensors from an ONNX model.

get_tensor_dtype

Get the appropriate tensor dtype based on precision info and zero point presence.

has_qdq_nodes

Check if the onnx graph already has QDQ nodes.

insert_dq_nodes

Insert new initializers and DQ nodes into graph.

insert_pre_quant_scale_nodes

Insert new mul nodes into graph.

insert_qdq_nodes

Insert scales and QDQ nodes into graph.

insert_transpose_nodes_for_column_major

Add a single Transpose node after each DequantizeLinear for column-major weights.

make_gs_awq_scale

Create a GraphSurgeon scale tensor from the given numpy array.

make_gs_dequantize_node

Create a GraphSurgeon Dequantize node.

make_gs_dequantize_output

Create a GraphSurgeon variable representing the output of a quantize node.

make_gs_pre_quant_scale_node

Create a GraphSurgeon Dequantize node.

make_gs_pre_quant_scale_output

Create a GraphSurgeon variable representing the output of a quantize node.

make_gs_quantize_node

Create a GraphSurgeon Quantize node.

make_gs_quantize_output

Create a GraphSurgeon variable representing the output of a quantize node.

make_gs_quantized_weight

Create a GraphSurgeon tensor from a quantized weight tensor.

make_gs_scale

Create a GraphSurgeon scale tensor from the given numpy array.

make_gs_zp

Create a GraphSurgeon zero-point tensor of all zeroes with the given shape.

qdq_to_dq

Convert FP32/FP16 weights of the given ONNX model to INT8/FP8 weights.

remove_graph_input_q

Remove Q nodes from the inputs of a quantized ONNX model.

remove_input_dq_and_output_q

Remove DQ nodes from the input and Q from the output of quantized custom ops for TensorRT compatibility.

replace_scale_values

Replace scale values in the graph with values from calibration cache.

replace_zero_scale_with_smallest_nonzero

Replace zero scale values with smallest nonzero fp16 value in the ONNX model.

update_attributes_for_per_channel_nodes

Get the attributes for per-channel nodes.

use_trt_qdq_ops

Globally set node names to TRT custom names.

validate_scale_shape_for_per_channel_nodes

Validate the shape of the scale tensor for per-channel nodes.

apply_column_major_transformation(gemm_weights_quantized, scales)

Transpose quantized weights and scales in-place for column-major storage.

Note: After calling this function and inserting DQ nodes with axis=1, you should call insert_transpose_nodes_for_column_major() on the graph.

Parameters:
  • gemm_weights_quantized (dict) – Dictionary mapping weight names to quantized weight arrays

  • scales (dict) – Dictionary mapping weight names to scale arrays

Return type:

None

cast_initializer_to_dtype(node, dtype, initializer_map)

Casts the initializer to the given dtype.

Parameters:
  • node (NodeProto)

  • dtype (str)

  • initializer_map (dict[str, TensorProto])

get_quantized_tensors(onnx_model)

Get the names of all quantized tensors from an ONNX model.

This function identifies all DequantizeLinear nodes in the ONNX model and extracts the names of tensors being dequantized (the first input of each DequantizeLinear node, excluding scale and zero-point inputs).

Parameters:

onnx_model (ModelProto) – ONNX model protobuf to analyze

Returns:

Set of tensor names that are inputs to DequantizeLinear nodes (i.e., the tensors being dequantized)

Return type:

set[str]

get_tensor_dtype(num_bits=4, has_zero_point=False)

Get the appropriate tensor dtype based on precision info and zero point presence.

Parameters:
  • num_bits (int) – Number of bits for quantization

  • has_zero_point (bool) – Whether the tensor has a zero point

Returns:

ONNX tensor data type constant

Return type:

int

has_qdq_nodes(onnx_model)

Check if the onnx graph already has QDQ nodes.

Parameters:

onnx_model (ModelProto)

insert_dq_nodes(graph, scales, quantized_weights, attributes=None, zero_points=None, layer_info=None)

Insert new initializers and DQ nodes into graph.

Parameters:
  • graph (Graph) – The graph to modify.

  • weights – A map from ONNX initializer name to tensor.

  • scales (dict[str, ndarray]) – A map from ONNX initializer name to desired scale factor for that initializer.

  • dq_only – Whether to only insert dq nodes.

  • layer_info (dict[str, dict] | None) – Optional dictionary mapping tensor names to precision (old format) or to layer configuration dict (new format with precision, block_size, axis).

  • quantized_weights (dict[str, ndarray])

  • attributes (dict[str, Any] | None)

  • zero_points (dict[str, ndarray] | None)

insert_pre_quant_scale_nodes(graph, input_tensors, pre_quant_scale)

Insert new mul nodes into graph.

Parameters:
  • graph (Graph) – The graph to modify.

  • input_tensors (dict[str, str]) – A dictionary of weight tensor names mapped to corresponding input tensor names

  • pre_quant_scale (dict[str, ndarray]) – A map from ONNX input tensor name to corresponding pre-quant scale.

insert_qdq_nodes(graph, scales, weight_map, layer_info=None)

Insert scales and QDQ nodes into graph.

Parameters:
  • graph (Graph) – The graph to modify.

  • scales (dict[str, ndarray]) – A map from ONNX initializer name to desired scale factor for that initializer.

  • weight_map (dict[str, Tensor]) – A map from ONNX initializer name to graphsurgeon tensor.

  • layer_info (dict[str, dict] | None) – Optional dictionary mapping tensor names to precision (old format) or to layer configuration dict (new format with precision, block_size, axis).

insert_transpose_nodes_for_column_major(graph)

Add a single Transpose node after each DequantizeLinear for column-major weights.

This implements the simple transformation: A @ B = A @ ((B^T)^T) where B^T is stored in the DequantizeLinear node, and we add a Transpose node after DQ to recover B before the MatMul.

Graph transformation:

Before: DQ(W) -> MatMul/Gemm After: DQ(W^T) -> Transpose -> W -> MatMul/Gemm

Parameters:

graph (Graph) – ONNX GraphSurgeon graph to modify in-place

make_gs_awq_scale(name, scale)

Create a GraphSurgeon scale tensor from the given numpy array.

name is the desired _basename_ of the tensor.

Parameters:
  • name (str)

  • scale (ndarray)

Return type:

Constant

make_gs_dequantize_node(name, inputs, outputs, attributes=None)

Create a GraphSurgeon Dequantize node.

name is the desired _basename_ of the node.

Parameters:
  • name (str)

  • inputs (Sequence[Tensor])

  • outputs (Sequence[Tensor])

  • attributes (dict[str, Any] | None)

Return type:

Node

make_gs_dequantize_output(name, shape, dtype)

Create a GraphSurgeon variable representing the output of a quantize node.

name is the desired _basename_ of the node.

Parameters:
  • name (str)

  • shape (Sequence[int])

  • dtype (dtype)

Return type:

Variable

make_gs_pre_quant_scale_node(name, inputs, outputs)

Create a GraphSurgeon Dequantize node.

name is the desired _basename_ of the node.

Parameters:
  • name (str)

  • inputs (Sequence[Tensor])

  • outputs (Sequence[Tensor])

Return type:

Node

make_gs_pre_quant_scale_output(name, shape, dtype)

Create a GraphSurgeon variable representing the output of a quantize node.

name is the desired _basename_ of the node.

Parameters:
  • name (str)

  • shape (Sequence[int])

  • dtype (dtype)

Return type:

Variable

make_gs_quantize_node(name, inputs, outputs)

Create a GraphSurgeon Quantize node.

name is the desired _basename_ of the node.

Parameters:
  • name (str)

  • inputs (Sequence[Tensor])

  • outputs (Sequence[Tensor])

Return type:

Node

make_gs_quantize_output(name, shape, dtype)

Create a GraphSurgeon variable representing the output of a quantize node.

name is the desired _basename_ of the node.

Parameters:
  • name (str)

  • shape (Sequence[int])

  • dtype (<google.protobuf.internal.enum_type_wrapper.EnumTypeWrapper object at 0x7f9b5da27980>)

Return type:

Variable

make_gs_quantized_weight(name, wq, dtype)

Create a GraphSurgeon tensor from a quantized weight tensor.

name is the desired _basename_ of the tensor.

Parameters:
  • name (str)

  • wq (ndarray)

Return type:

Constant

make_gs_scale(name, scale)

Create a GraphSurgeon scale tensor from the given numpy array.

name is the desired _basename_ of the tensor.

Parameters:
  • name (str)

  • scale (ndarray)

Return type:

Constant

make_gs_zp(name, shape, dtype)

Create a GraphSurgeon zero-point tensor of all zeroes with the given shape.

name is the desired _basename_ of the tensor.

Parameters:
  • name (str)

  • shape (Sequence[int])

Return type:

Constant

qdq_to_dq(onnx_model)

Convert FP32/FP16 weights of the given ONNX model to INT8/FP8 weights.

This function converts a model with QDQ (QuantizeLinear-DequantizeLinear) nodes to a model with only DQ nodes for weights. It: 1. Converts FP32/FP16 weights to INT8/FP8 2. Updates the graph to maintain proper connections 3. Removes redundant cast nodes in the quantized model (additional optimization for diffusers)

Parameters:

onnx_model (ModelProto) – ONNX model protobuf to convert

Returns:

ONNX model protobuf with only DQ nodes for weights

Raises:
  • ValueError – If the model is invalid or conversion fails

  • RuntimeError – If graph operations fail

Return type:

ModelProto

remove_graph_input_q(onnx_model)

Remove Q nodes from the inputs of a quantized ONNX model.

This supports generating quantized models with low-precision graph I/O.

Parameters:

onnx_model (ModelProto) – ONNX model protobuf to convert

Returns:

ONNX model protobuf with only DQ in the inputs whenever possible.

Raises:
  • ValueError – If the model is invalid or removal fails

  • RuntimeError – If graph operations fail

Return type:

ModelProto

remove_input_dq_and_output_q(onnx_model, quantizable_custom_ops)

Remove DQ nodes from the input and Q from the output of quantized custom ops for TensorRT compatibility.

TensorRT requires only Q nodes in the inputs and only DQ nodes in the outputs of custom ops. For more information, see https://docs.nvidia.com/deeplearning/tensorrt/latest/inference-library/work-quantized-types.html#q-dq-interaction-with-plugins

Parameters:
  • onnx_model (ModelProto) – ONNX model protobuf to convert

  • quantizable_custom_ops (dict) – dictionary of custom ops and I/O indices to perform Q and DQ deletions as needed.

Returns:

ONNX model protobuf with only Q in the inputs and only DQ in the outputs of custom ops.

Raises:
  • ValueError – If the model is invalid or removal fails

  • RuntimeError – If graph operations fail

Return type:

ModelProto

replace_scale_values(graph, act_scales_dict)

Replace scale values in the graph with values from calibration cache.

Parameters:
  • graph (GraphProto) – ONNX graph to modify

  • act_scales_dict (dict[str, float]) – Dictionary mapping scale tensor names to their new values

Return type:

None

replace_zero_scale_with_smallest_nonzero(onnx_model)

Replace zero scale values with smallest nonzero fp16 value in the ONNX model.

Parameters:

onnx_model (ModelProto)

Return type:

ModelProto

update_attributes_for_per_channel_nodes(attributes=None, num_bits=4)

Get the attributes for per-channel nodes.

Parameters:
  • attributes (dict[str, Any] | None)

  • num_bits (int)

Return type:

dict[str, Any] | None

use_trt_qdq_ops()

Globally set node names to TRT custom names.

validate_scale_shape_for_per_channel_nodes(scale, attrs=None, num_bits=4)

Validate the shape of the scale tensor for per-channel nodes.

Parameters:
  • scale (ndarray)

  • attrs (dict[str, Any] | None)

  • num_bits (int)