qdq_utils

Various utils to support inserting Q/DQ nodes.

Functions

fp4qdq_to_2dq

Convert FP32/FP16 weights of the given ONNX model to FP4 weights and scaling factors.

insert_dq_nodes

Insert new initializers and DQ nodes into graph.

insert_pre_quant_scale_nodes

Insert new mul nodes into graph.

insert_qdq_nodes

Insert scales and QDQ nodes into graph.

make_gs_awq_scale

Create a GraphSurgeon scale tensor from the given numpy array.

make_gs_dequantize_node

Create a GraphSurgeon Dequantize node.

make_gs_dequantize_output

Create a GraphSurgeon variable representing the output of a quantize node.

make_gs_pre_quant_scale_node

Create a GraphSurgeon Dequantize node.

make_gs_pre_quant_scale_output

Create a GraphSurgeon variable representing the output of a quantize node.

make_gs_quantize_node

Create a GraphSurgeon Quantize node.

make_gs_quantize_output

Create a GraphSurgeon variable representing the output of a quantize node.

make_gs_quantized_weight

Create a GraphSurgeon tensor from a quantized weight tensor.

make_gs_scale

Create a GraphSurgeon scale tensor from the given numpy array.

make_gs_zp

Create a GraphSurgeon zero-point tensor of all zeroes with the given shape.

qdq_to_dq

Convert FP32/FP16 weights of the given ONNX model to INT8/FP8 weights.

replace_fp4qdq_with_2dq

Replaces the given node in the ONNX graph with a subgraph consisting of two DequantizeLinear nodes.

replace_scale_values

Replaces the scales values from calibration cache.

use_trt_qdq_ops

Globally set node names to TRT custom names.

fp4qdq_to_2dq(onnx_model)

Convert FP32/FP16 weights of the given ONNX model to FP4 weights and scaling factors.

TRT_FP4QDQ nodes will get removed from the weights and have two DQ nodes with those converted FP4 weights and scaling factors in the output model.

Parameters:

onnx_model (ModelProto) – ONNX model protobuf.

Returns:

ONNX model protobuf with DQ nodes for weights and DynQ + DQ nodes for activations.

Return type:

ModelProto

insert_dq_nodes(graph, scales, quantized_weights, attributes=None, zero_points=None)

Insert new initializers and DQ nodes into graph.

Parameters:
  • graph (Graph) – The graph to modify.

  • weights – A map from ONNX initializer name to tensor.

  • scales (Dict[str, ndarray]) – A map from ONNX initializer name to desired scale factor for that initializer.

  • dq_only – Whether to only insert dq nodes.

  • quantized_weights (Dict[str, ndarray]) –

  • attributes (Dict[str, Any]) –

  • zero_points (Dict[str, ndarray] | None) –

insert_pre_quant_scale_nodes(graph, input_tensors, pre_quant_scale)

Insert new mul nodes into graph.

Parameters:
  • graph (Graph) – The graph to modify.

  • input_tensors (Dict[str, str]) – A dictionary of weight tensor names mapped to corresponding input tensor names

  • pre_quant_scale (Dict[str, ndarray]) – A map from ONNX input tensor name to corresponding pre-quant scale.

insert_qdq_nodes(graph, scales, weight_map)

Insert scales and QDQ nodes into graph.

Parameters:
  • graph (Graph) – The graph to modify.

  • scales (Dict[str, ndarray]) – A map from ONNX initializer name to desired scale factor for that initializer.

  • weight_map (Dict[str, Tensor]) – A map from ONNX initializer name to graphsurgeon tensor.

make_gs_awq_scale(name, scale)

Create a GraphSurgeon scale tensor from the given numpy array.

name is the desired _basename_ of the tensor.

Parameters:
  • name (str) –

  • scale (ndarray) –

Return type:

Constant

make_gs_dequantize_node(name, inputs, outputs, attributes=None)

Create a GraphSurgeon Dequantize node.

name is the desired _basename_ of the node.

Parameters:
  • name (str) –

  • inputs (Sequence[Tensor]) –

  • outputs (Sequence[Tensor]) –

  • attributes (Dict[str, Any]) –

Return type:

Node

make_gs_dequantize_output(name, shape, dtype)

Create a GraphSurgeon variable representing the output of a quantize node.

name is the desired _basename_ of the node.

Parameters:
  • name (str) –

  • shape (Sequence[int]) –

  • dtype (dtype) –

Return type:

Variable

make_gs_pre_quant_scale_node(name, inputs, outputs)

Create a GraphSurgeon Dequantize node.

name is the desired _basename_ of the node.

Parameters:
  • name (str) –

  • inputs (Sequence[Tensor]) –

  • outputs (Sequence[Tensor]) –

Return type:

Node

make_gs_pre_quant_scale_output(name, shape, dtype)

Create a GraphSurgeon variable representing the output of a quantize node.

name is the desired _basename_ of the node.

Parameters:
  • name (str) –

  • shape (Sequence[int]) –

  • dtype (dtype) –

Return type:

Variable

make_gs_quantize_node(name, inputs, outputs)

Create a GraphSurgeon Quantize node.

name is the desired _basename_ of the node.

Parameters:
  • name (str) –

  • inputs (Sequence[Tensor]) –

  • outputs (Sequence[Tensor]) –

Return type:

Node

make_gs_quantize_output(name, shape, dtype)

Create a GraphSurgeon variable representing the output of a quantize node.

name is the desired _basename_ of the node.

Parameters:
  • name (str) –

  • shape (Sequence[int]) –

  • dtype (<google.protobuf.internal.enum_type_wrapper.EnumTypeWrapper object at 0x7feba0859df0>) –

Return type:

Variable

make_gs_quantized_weight(name, wq, dtype)

Create a GraphSurgeon tensor from a quantized weight tensor.

name is the desired _basename_ of the tensor.

Parameters:
  • name (str) –

  • wq (ndarray) –

Return type:

Constant

make_gs_scale(name, scale)

Create a GraphSurgeon scale tensor from the given numpy array.

name is the desired _basename_ of the tensor.

Parameters:
  • name (str) –

  • scale (ndarray) –

Return type:

Constant

make_gs_zp(name, shape, dtype)

Create a GraphSurgeon zero-point tensor of all zeroes with the given shape.

name is the desired _basename_ of the tensor.

Parameters:
  • name (str) –

  • shape (Sequence[int]) –

Return type:

Constant

qdq_to_dq(onnx_model, verbose=False)

Convert FP32/FP16 weights of the given ONNX model to INT8/FP8 weights.

Q nodes will get removed from the weights and have only DQ nodes with those converted INT8/FP8 weights in the output model. Also dangling Q nodes get fused and update its consumer’s weight.

Parameters:
  • onnx_model (ModelProto) – ONNX model protobuf.

  • verbose (bool) –

Returns:

ONNX model protobuf with only DQ nodes for weights and QDQ nodes for activations.

Return type:

ModelProto

replace_fp4qdq_with_2dq(graph, node, initializer_indices, value_info_map, graph_inputs, w_f4, sw_f32_per_tensor, sw_f8_per_block, precision_dtype, block_size)

Replaces the given node in the ONNX graph with a subgraph consisting of two DequantizeLinear nodes.

Parameters:
  • graph (GraphProto) – The ONNX graph containing the node to replace.

  • node (NodeProto) – The node to be replaced.

  • initializer_indices (Dict[str, int]) – A dictionary mapping initializer names to their indices in the graph.

  • value_info_map (Dict[str, ValueInfoProto]) – A dictionary mapping value info names to their ValueInfoProto objects.

  • graph_inputs (Set[str]) – A set of graph input names.

  • w_f4 (ndarray) – NumPy array for w_f4.

  • sw_f32_per_tensor (ndarray) – NumPy array for sw_f32_per_tensor.

  • sw_f8_per_block (ndarray) – NumPy array for sw_f8_per_block.

  • precision_dtype (str) – The precision of the weights.

  • block_size (int) – Block size used in block quantization.

replace_scale_values(graph, act_scales_dict)

Replaces the scales values from calibration cache.

Parameters:
  • graph (GraphProto) –

  • act_scales_dict (Dict[str, float]) –

use_trt_qdq_ops()

Globally set node names to TRT custom names.