graph_utils
Provides ONNX graph related utils for QDQ placement.
Functions
Adds cast_to_fp16 nodes to the inputs of a layer and cast_to_fp32 to the outputs. |
|
Builds a map of non-residual Add input name to the Add node name from the given graph. |
|
We should partially quantize the partition nodes with inputs outside of the partition. |
|
Convert graph I/O to FP16. |
|
Expand the node names from the given patterns. |
|
Returns the list of kgen head names if it follows a CASK partition. |
|
Match FP8 MHA: Q -> DQ -> BMM1 -> (Mul/Div) -> (Add) -> Softmax -> (Cast) -> Q -> DQ -> BMM2 -> Q -> DQ. |
|
Match MHA: BMM1 -> (Mul/Div) -> (Add) -> Softmax -> (Cast) -> BMM2. |
|
Find MatMul nodes that meets gemv condition to exclude. |
|
Find MatMul nodes in MHA pattern to exclude. |
|
Find the node names from the ONNX graph which matches user's exclusion patterns. |
|
Returns the linear backbone node for a given node if it matches the pattern. |
|
Returns a dictionary of tensor name and their consumer node object mapping. |
|
Returns a dictionary of tensor name and their producer node object mapping. |
|
Returns whether the given node has any constant input. |
|
Checks if the given node is start/end of a given forward/backward path type. |
|
Insert three cast ops. |
|
Insert three cast nodes for MatMul's two inputs and output. |
|
Returns whether the given tensor is an initializer or produced by const-foldable nodes. |
|
Collect and print stats of the quantized model. |
|
Modifies the onnx model by removing QDQ nodes from the marked inputs, ex. |
- add_fp16_fp32_cast(onnx_path, custom_ops_to_cast_to_fp16)
Adds cast_to_fp16 nodes to the inputs of a layer and cast_to_fp32 to the outputs.
- build_non_residual_input_map(graph)
Builds a map of non-residual Add input name to the Add node name from the given graph.
This assumes that the Add layer only has 2 inputs.
We will refer to a subgraph which has a Convolution node with a single output that is summed (element-wise) with another non-constant input-tensor as a “residual-add” subgraph, because it occurs in modern convnets that use residual connections.
- Parameters:
graph (Graph) – Onnx model graph.
- Returns:
Dictionary of Add node names vs their non-residual input name.
- Return type:
Dict[str, str]
- classify_partition_nodes(partitions)
We should partially quantize the partition nodes with inputs outside of the partition.
- Parameters:
partitions (List[List[Node]]) – Partitions created by modelopt ptq algo.
- Returns:
List of non-quantizable nodes. List of quantizable nodes. List of partially-quantizable inputs with non-quantizable input info as (src, dst, input_name)
- Return type:
Tuple[List[Node], List[Node], List[Tuple[Node, Node, str]]]
- convert_fp16_io(graph)
Convert graph I/O to FP16.
- expand_node_names_from_patterns(graph, name_patterns)
Expand the node names from the given patterns.
- Parameters:
graph (GraphProto | Graph) –
name_patterns (List[str]) –
- Return type:
List[str]
- filter_quantizable_kgen_heads(cask_fusible_partitions, kgen_partitions, quantizable_op_types)
Returns the list of kgen head names if it follows a CASK partition.
- Parameters:
cask_fusible_partitions (List[List[Node]]) –
kgen_partitions (List[List[Node]]) –
quantizable_op_types (List[str]) –
- Return type:
Tuple[List[Node], List[Tuple[Node, Node, str]]]
- find_fp8_mha_partitions(graph)
Match FP8 MHA: Q -> DQ -> BMM1 -> (Mul/Div) -> (Add) -> Softmax -> (Cast) -> Q -> DQ -> BMM2 -> Q -> DQ.
- find_mha_partitions(graph)
Match MHA: BMM1 -> (Mul/Div) -> (Add) -> Softmax -> (Cast) -> BMM2.
- find_nodes_from_matmul_to_exclude(onnx_path, use_external_data_format=False, intermediate_generated_files=None, calibration_shapes=None, calibration_eps=['cuda:0', 'cpu', 'trt'], verbose=False)
Find MatMul nodes that meets gemv condition to exclude.
Either of m or n in matmul is 1, this matmul cannot utilize TensorCores. The perf of adding Q/DQ layers is not good in TRT. Thus, in this case, do not add Q/DQ layers to this matmul.
- Parameters:
onnx_path (str) – Path to the onnx model.
use_external_data_format (bool) – If not None, this path will be used to store the weights of the quantized model.
intermediate_generated_files (List[str]) – List of intermediate generated files that will be deleted after quantization.
calibration_shapes (str) – Model input shapes for inferenece.
calibration_eps (List[str]) – Priority order for the execution providers (EP) to calibrate the model. Any subset of [‘cuda:x’, ‘cpu’, ‘trt’], where ‘x’ is the device id.
verbose (bool) – If True, print the matmul nodes to exclude.
- Returns:
List of Nodes to exclude from quantization.
- Return type:
List[str]
- find_nodes_from_mha_to_exclude(onnx_path, use_external_data_format=False, nodes_to_exclude=None, disable_mha_qdq=False, quantize_mode='int8', high_precision_dtype=None, intermediate_generated_files=None, calibration_shapes=None, calibration_eps=['cuda:0', 'cpu', 'trt'], verbose=False)
Find MatMul nodes in MHA pattern to exclude.
If disable_mha_qdq is set, don’t add Q/DQ layers to MatMuls in MHA pattern. else when quantize_mode == “fp8” and high_precision_dtype == “fp16”, if head_size is not the multiple of 16 or there is maskadd in MHA pattern, don’t add Q/DQ layers to MatMuls in MHA pattern. else when quantize_mode == “int8”, if seq_len > 512, don’t add Q/DQ layers to MatMuls in MHA pattern.
- Parameters:
onnx_path (str) – Path to the onnx model.
use_external_data_format (bool) – If not None, this path will be used to store the weights of the quantized model.
nodes_to_exclude (List[str]) – List of Nodes to exclude from quantization.
disable_mha_qdq (bool) – If True, all MHA’s BMM1 and BMM2 will be added to nodes_to_exclude. Else, each MHA will be checked whether to enable QDQ or not when is_fp8fp16 is True.
quantize_mode (str) – Quantization mode. One of ‘int8’ (default), ‘int4’ and ‘fp8’.
high_precision_dtype (str) – High precision data type, one of [‘fp32’, ‘fp16’]. If high_precision_dtype == ‘fp16’, model’s weight and activation will be converted to fp16.
intermediate_generated_files (List[str]) – List of intermediate generated files that will be deleted after quantization.
calibration_shapes (str) – Model input shapes for inferenece.
calibration_eps (List[str]) – Priority list of execution providers (EP) for calibration.
verbose (bool) – If True, print the matmul nodes to exclude.
- Returns:
List of Nodes to exclude from quantization.
- Return type:
List[str]
- find_nodes_to_exclude(graph, nodes_to_exclude, op_types_to_exclude)
Find the node names from the ONNX graph which matches user’s exclusion patterns.
- Parameters:
graph (Graph) –
nodes_to_exclude (List[str]) –
op_types_to_exclude (List[str]) –
- get_fusible_backbone(node, graph)
Returns the linear backbone node for a given node if it matches the pattern.
TensorRT fuses convolution with BN, Relu etc. when in some specific pattern. This rule tries to match some of those patterns. Note. BiasAdd and ConstMul are optional in path types.
- Parameters:
node (Node) – Start node of the pattern.
graph (Graph) – ONNX model graph.
- Returns:
Backbone node of the given node, None if not found.
- Return type:
Node | None
- get_tensor_consumer_nodes(graph)
Returns a dictionary of tensor name and their consumer node object mapping.
- Parameters:
graph (GraphProto) – ONNX model graph.
- Returns:
Dictionary, key is tensor name and value is their consumer node object
- Return type:
Dict[str, List[NodeProto]]
- get_tensor_producer_nodes(graph)
Returns a dictionary of tensor name and their producer node object mapping.
Note. we create a special Root type node as external inputs producer for ease of implementation.
- Parameters:
graph (GraphProto) – ONNX model graph.
- Returns:
Dictionary, key is tensor name and value is their producer node object
- Return type:
Dict[str, NodeProto]
- has_const_input(node)
Returns whether the given node has any constant input.
- Parameters:
node (Node) –
- Return type:
bool
- has_path_type(node, graph, path_type, is_forward, wild_card_types=[], path_nodes=[])
Checks if the given node is start/end of a given forward/backward path type.
Note, Path can be forward or backward wrt a node depending on the next level nodes. Additionally, this method can work with optional nodes and collect the traversed path.
- Parameters:
node (Node) – Start node of the path.
graph (Graph) – ONNX model graph.
path_type (List[str]) – Path types to match from the given node.
is_forward (bool) – Whether to match forward or backward path.
wild_card_types (List[str]) – Wild card types, these type of nodes are skipped and not matched with the path_type.
path_nodes (List[Node]) – Accumulated nodes in the matched path.
- Returns:
Bool, whether the given node is start/end of the given forward/backward path type.
- Return type:
bool
- insert_fp8_mha_casts(onnx_model)
Insert three cast ops.
The first cast will be added before the input0 of MatMul to cast fp16 to fp32. The second cast will be added before the input1 of MatMul to cast fp16 to fp32. The third cast will be added after the output of MatMul to cast fp32 back to fp16. The insertion of Cast ops in the FP8 MHA part actually forbids the MHAs to run with FP16 accumulation because the compiler only has FP32 accumulation kernels for FP8 MHAs.
- insert_matmul_casts(graph, matmul_node)
Insert three cast nodes for MatMul’s two inputs and output.
- is_const_input(tensor)
Returns whether the given tensor is an initializer or produced by const-foldable nodes.
- Parameters:
tensor (Tensor) –
- Return type:
bool
- print_stat(graph, verbose)
Collect and print stats of the quantized model.
- Parameters:
graph (Graph) –
verbose (bool) –
- Return type:
None
- remove_partial_input_qdq(graph, no_quantize_inputs)
Modifies the onnx model by removing QDQ nodes from the marked inputs, ex. non-residual inputs etc.
- Parameters:
graph (Graph) – Onnx model graph.
no_quantize_inputs (List[Tuple[Node, Node, str]]) – List non-quantizable input info as (src, dst, input_name)
- Return type:
None