qdq_utils
Various utils to support inserting Q/DQ nodes.
Functions
Convert FP32/FP16 weights of the given ONNX model to FP4 weights and scaling factors. |
|
Insert new initializers and DQ nodes into graph. |
|
Insert new mul nodes into graph. |
|
Insert scales and QDQ nodes into graph. |
|
Create a GraphSurgeon scale tensor from the given numpy array. |
|
Create a GraphSurgeon Dequantize node. |
|
Create a GraphSurgeon variable representing the output of a quantize node. |
|
Create a GraphSurgeon Dequantize node. |
|
Create a GraphSurgeon variable representing the output of a quantize node. |
|
Create a GraphSurgeon Quantize node. |
|
Create a GraphSurgeon variable representing the output of a quantize node. |
|
Create a GraphSurgeon tensor from a quantized weight tensor. |
|
Create a GraphSurgeon scale tensor from the given numpy array. |
|
Create a GraphSurgeon zero-point tensor of all zeroes with the given shape. |
|
Convert FP32/FP16 weights of the given ONNX model to INT8/FP8 weights. |
|
Replaces the given node in the ONNX graph with a subgraph consisting of two DequantizeLinear nodes. |
|
Replaces the scales values from calibration cache. |
|
Globally set node names to TRT custom names. |
- fp4qdq_to_2dq(onnx_model)
Convert FP32/FP16 weights of the given ONNX model to FP4 weights and scaling factors.
TRT_FP4QDQ nodes will get removed from the weights and have two DQ nodes with those converted FP4 weights and scaling factors in the output model.
- Parameters:
onnx_model (ModelProto) – ONNX model protobuf.
- Returns:
ONNX model protobuf with DQ nodes for weights and DynQ + DQ nodes for activations.
- Return type:
ModelProto
- insert_dq_nodes(graph, scales, quantized_weights, attributes=None, zero_points=None)
Insert new initializers and DQ nodes into graph.
- Parameters:
graph (Graph) – The graph to modify.
weights – A map from ONNX initializer name to tensor.
scales (Dict[str, ndarray]) – A map from ONNX initializer name to desired scale factor for that initializer.
dq_only – Whether to only insert dq nodes.
quantized_weights (Dict[str, ndarray]) –
attributes (Dict[str, Any]) –
zero_points (Dict[str, ndarray] | None) –
- insert_pre_quant_scale_nodes(graph, input_tensors, pre_quant_scale)
Insert new mul nodes into graph.
- Parameters:
graph (Graph) – The graph to modify.
input_tensors (Dict[str, str]) – A dictionary of weight tensor names mapped to corresponding input tensor names
pre_quant_scale (Dict[str, ndarray]) – A map from ONNX input tensor name to corresponding pre-quant scale.
- insert_qdq_nodes(graph, scales, weight_map)
Insert scales and QDQ nodes into graph.
- Parameters:
graph (Graph) – The graph to modify.
scales (Dict[str, ndarray]) – A map from ONNX initializer name to desired scale factor for that initializer.
weight_map (Dict[str, Tensor]) – A map from ONNX initializer name to graphsurgeon tensor.
- make_gs_awq_scale(name, scale)
Create a GraphSurgeon scale tensor from the given numpy array.
name is the desired _basename_ of the tensor.
- Parameters:
name (str) –
scale (ndarray) –
- Return type:
Constant
- make_gs_dequantize_node(name, inputs, outputs, attributes=None)
Create a GraphSurgeon Dequantize node.
name is the desired _basename_ of the node.
- Parameters:
name (str) –
inputs (Sequence[Tensor]) –
outputs (Sequence[Tensor]) –
attributes (Dict[str, Any]) –
- Return type:
Node
- make_gs_dequantize_output(name, shape, dtype)
Create a GraphSurgeon variable representing the output of a quantize node.
name is the desired _basename_ of the node.
- Parameters:
name (str) –
shape (Sequence[int]) –
dtype (dtype) –
- Return type:
Variable
- make_gs_pre_quant_scale_node(name, inputs, outputs)
Create a GraphSurgeon Dequantize node.
name is the desired _basename_ of the node.
- Parameters:
name (str) –
inputs (Sequence[Tensor]) –
outputs (Sequence[Tensor]) –
- Return type:
Node
- make_gs_pre_quant_scale_output(name, shape, dtype)
Create a GraphSurgeon variable representing the output of a quantize node.
name is the desired _basename_ of the node.
- Parameters:
name (str) –
shape (Sequence[int]) –
dtype (dtype) –
- Return type:
Variable
- make_gs_quantize_node(name, inputs, outputs)
Create a GraphSurgeon Quantize node.
name is the desired _basename_ of the node.
- Parameters:
name (str) –
inputs (Sequence[Tensor]) –
outputs (Sequence[Tensor]) –
- Return type:
Node
- make_gs_quantize_output(name, shape, dtype)
Create a GraphSurgeon variable representing the output of a quantize node.
name is the desired _basename_ of the node.
- Parameters:
name (str) –
shape (Sequence[int]) –
dtype (<google.protobuf.internal.enum_type_wrapper.EnumTypeWrapper object at 0x7feba0859df0>) –
- Return type:
Variable
- make_gs_quantized_weight(name, wq, dtype)
Create a GraphSurgeon tensor from a quantized weight tensor.
name is the desired _basename_ of the tensor.
- Parameters:
name (str) –
wq (ndarray) –
- Return type:
Constant
- make_gs_scale(name, scale)
Create a GraphSurgeon scale tensor from the given numpy array.
name is the desired _basename_ of the tensor.
- Parameters:
name (str) –
scale (ndarray) –
- Return type:
Constant
- make_gs_zp(name, shape, dtype)
Create a GraphSurgeon zero-point tensor of all zeroes with the given shape.
name is the desired _basename_ of the tensor.
- Parameters:
name (str) –
shape (Sequence[int]) –
- Return type:
Constant
- qdq_to_dq(onnx_model, verbose=False)
Convert FP32/FP16 weights of the given ONNX model to INT8/FP8 weights.
Q nodes will get removed from the weights and have only DQ nodes with those converted INT8/FP8 weights in the output model. Also dangling Q nodes get fused and update its consumer’s weight.
- Parameters:
onnx_model (ModelProto) – ONNX model protobuf.
verbose (bool) –
- Returns:
ONNX model protobuf with only DQ nodes for weights and QDQ nodes for activations.
- Return type:
ModelProto
- replace_fp4qdq_with_2dq(graph, node, initializer_indices, value_info_map, graph_inputs, w_f4, sw_f32_per_tensor, sw_f8_per_block, precision_dtype, block_size)
Replaces the given node in the ONNX graph with a subgraph consisting of two DequantizeLinear nodes.
- Parameters:
graph (GraphProto) – The ONNX graph containing the node to replace.
node (NodeProto) – The node to be replaced.
initializer_indices (Dict[str, int]) – A dictionary mapping initializer names to their indices in the graph.
value_info_map (Dict[str, ValueInfoProto]) – A dictionary mapping value info names to their ValueInfoProto objects.
graph_inputs (Set[str]) – A set of graph input names.
w_f4 (ndarray) – NumPy array for w_f4.
sw_f32_per_tensor (ndarray) – NumPy array for sw_f32_per_tensor.
sw_f8_per_block (ndarray) – NumPy array for sw_f8_per_block.
precision_dtype (str) – The precision of the weights.
block_size (int) – Block size used in block quantization.
- replace_scale_values(graph, act_scales_dict)
Replaces the scales values from calibration cache.
- Parameters:
graph (GraphProto) –
act_scales_dict (Dict[str, float]) –
- use_trt_qdq_ops()
Globally set node names to TRT custom names.