graph_utils

Provides ONNX graph related utils for QDQ placement.

Functions

add_fp16_fp32_cast

Adds cast_to_fp16 nodes to the inputs of a layer and cast_to_fp32 to the outputs.

build_non_residual_input_map

Builds a map of non-residual Add input name to the Add node name from the given graph.

classify_partition_nodes

We should partially quantize the partition nodes with inputs outside of the partition.

convert_fp16_io

Convert graph I/O to FP16.

expand_node_names_from_patterns

Expand the node names from the given patterns.

filter_quantizable_kgen_heads

Returns the list of kgen head names if it follows a CASK partition.

find_fp8_mha_partitions

Match FP8 MHA: Q -> DQ -> BMM1 -> (Mul/Div) -> (Add) -> Softmax -> (Cast) -> Q -> DQ -> BMM2 -> Q -> DQ.

find_mha_partitions

Match MHA: BMM1 -> (Mul/Div) -> (Add) -> Softmax -> (Cast) -> BMM2.

find_nodes_from_matmul_to_exclude

Find MatMul nodes that meets gemv condition to exclude.

find_nodes_from_mha_to_exclude

Find MatMul nodes in MHA pattern to exclude.

find_nodes_to_exclude

Find the node names from the ONNX graph which matches user's exclusion patterns.

get_fusible_backbone

Returns the linear backbone node for a given node if it matches the pattern.

get_tensor_consumer_nodes

Returns a dictionary of tensor name and their consumer node object mapping.

get_tensor_producer_nodes

Returns a dictionary of tensor name and their producer node object mapping.

has_const_input

Returns whether the given node has any constant input.

has_path_type

Checks if the given node is start/end of a given forward/backward path type.

insert_fp8_mha_casts

Insert three cast ops.

insert_matmul_casts

Insert three cast nodes for MatMul's two inputs and output.

is_const_input

Returns whether the given tensor is an initializer or produced by const-foldable nodes.

print_stat

Collect and print stats of the quantized model.

remove_partial_input_qdq

Modifies the onnx model by removing QDQ nodes from the marked inputs, ex.

add_fp16_fp32_cast(onnx_path, custom_ops_to_cast_to_fp16)

Adds cast_to_fp16 nodes to the inputs of a layer and cast_to_fp32 to the outputs.

build_non_residual_input_map(graph)

Builds a map of non-residual Add input name to the Add node name from the given graph.

This assumes that the Add layer only has 2 inputs.

We will refer to a subgraph which has a Convolution node with a single output that is summed (element-wise) with another non-constant input-tensor as a “residual-add” subgraph, because it occurs in modern convnets that use residual connections.

Parameters:

graph (Graph) – Onnx model graph.

Returns:

Dictionary of Add node names vs their non-residual input name.

Return type:

Dict[str, str]

classify_partition_nodes(partitions)

We should partially quantize the partition nodes with inputs outside of the partition.

Parameters:

partitions (List[List[Node]]) – Partitions created by modelopt ptq algo.

Returns:

List of non-quantizable nodes. List of quantizable nodes. List of partially-quantizable inputs with non-quantizable input info as (src, dst, input_name)

Return type:

Tuple[List[Node], List[Node], List[Tuple[Node, Node, str]]]

convert_fp16_io(graph)

Convert graph I/O to FP16.

expand_node_names_from_patterns(graph, name_patterns)

Expand the node names from the given patterns.

Parameters:
  • graph (GraphProto | Graph) –

  • name_patterns (List[str]) –

Return type:

List[str]

filter_quantizable_kgen_heads(cask_fusible_partitions, kgen_partitions, quantizable_op_types)

Returns the list of kgen head names if it follows a CASK partition.

Parameters:
  • cask_fusible_partitions (List[List[Node]]) –

  • kgen_partitions (List[List[Node]]) –

  • quantizable_op_types (List[str]) –

Return type:

Tuple[List[Node], List[Tuple[Node, Node, str]]]

find_fp8_mha_partitions(graph)

Match FP8 MHA: Q -> DQ -> BMM1 -> (Mul/Div) -> (Add) -> Softmax -> (Cast) -> Q -> DQ -> BMM2 -> Q -> DQ.

find_mha_partitions(graph)

Match MHA: BMM1 -> (Mul/Div) -> (Add) -> Softmax -> (Cast) -> BMM2.

find_nodes_from_matmul_to_exclude(onnx_path, use_external_data_format=False, intermediate_generated_files=None, calibration_shapes=None, calibration_eps=['cuda:0', 'cpu', 'trt'], verbose=False)

Find MatMul nodes that meets gemv condition to exclude.

Either of m or n in matmul is 1, this matmul cannot utilize TensorCores. The perf of adding Q/DQ layers is not good in TRT. Thus, in this case, do not add Q/DQ layers to this matmul.

Parameters:
  • onnx_path (str) – Path to the onnx model.

  • use_external_data_format (bool) – If not None, this path will be used to store the weights of the quantized model.

  • intermediate_generated_files (List[str]) – List of intermediate generated files that will be deleted after quantization.

  • calibration_shapes (str) – Model input shapes for inferenece.

  • calibration_eps (List[str]) – Priority order for the execution providers (EP) to calibrate the model. Any subset of [‘cuda:x’, ‘cpu’, ‘trt’], where ‘x’ is the device id.

  • verbose (bool) – If True, print the matmul nodes to exclude.

Returns:

List of Nodes to exclude from quantization.

Return type:

List[str]

find_nodes_from_mha_to_exclude(onnx_path, use_external_data_format=False, nodes_to_exclude=None, disable_mha_qdq=False, quantize_mode='int8', high_precision_dtype=None, intermediate_generated_files=None, calibration_shapes=None, calibration_eps=['cuda:0', 'cpu', 'trt'], verbose=False)

Find MatMul nodes in MHA pattern to exclude.

If disable_mha_qdq is set, don’t add Q/DQ layers to MatMuls in MHA pattern. else when quantize_mode == “fp8” and high_precision_dtype == “fp16”, if head_size is not the multiple of 16 or there is maskadd in MHA pattern, don’t add Q/DQ layers to MatMuls in MHA pattern. else when quantize_mode == “int8”, if seq_len > 512, don’t add Q/DQ layers to MatMuls in MHA pattern.

Parameters:
  • onnx_path (str) – Path to the onnx model.

  • use_external_data_format (bool) – If not None, this path will be used to store the weights of the quantized model.

  • nodes_to_exclude (List[str]) – List of Nodes to exclude from quantization.

  • disable_mha_qdq (bool) – If True, all MHA’s BMM1 and BMM2 will be added to nodes_to_exclude. Else, each MHA will be checked whether to enable QDQ or not when is_fp8fp16 is True.

  • quantize_mode (str) – Quantization mode. One of ‘int8’ (default), ‘int4’ and ‘fp8’.

  • high_precision_dtype (str) – High precision data type, one of [‘fp32’, ‘fp16’]. If high_precision_dtype == ‘fp16’, model’s weight and activation will be converted to fp16.

  • intermediate_generated_files (List[str]) – List of intermediate generated files that will be deleted after quantization.

  • calibration_shapes (str) – Model input shapes for inferenece.

  • calibration_eps (List[str]) – Priority list of execution providers (EP) for calibration.

  • verbose (bool) – If True, print the matmul nodes to exclude.

Returns:

List of Nodes to exclude from quantization.

Return type:

List[str]

find_nodes_to_exclude(graph, nodes_to_exclude, op_types_to_exclude)

Find the node names from the ONNX graph which matches user’s exclusion patterns.

Parameters:
  • graph (Graph) –

  • nodes_to_exclude (List[str]) –

  • op_types_to_exclude (List[str]) –

get_fusible_backbone(node, graph)

Returns the linear backbone node for a given node if it matches the pattern.

TensorRT fuses convolution with BN, Relu etc. when in some specific pattern. This rule tries to match some of those patterns. Note. BiasAdd and ConstMul are optional in path types.

Parameters:
  • node (Node) – Start node of the pattern.

  • graph (Graph) – ONNX model graph.

Returns:

Backbone node of the given node, None if not found.

Return type:

Node | None

get_tensor_consumer_nodes(graph)

Returns a dictionary of tensor name and their consumer node object mapping.

Parameters:

graph (GraphProto) – ONNX model graph.

Returns:

Dictionary, key is tensor name and value is their consumer node object

Return type:

Dict[str, List[NodeProto]]

get_tensor_producer_nodes(graph)

Returns a dictionary of tensor name and their producer node object mapping.

Note. we create a special Root type node as external inputs producer for ease of implementation.

Parameters:

graph (GraphProto) – ONNX model graph.

Returns:

Dictionary, key is tensor name and value is their producer node object

Return type:

Dict[str, NodeProto]

has_const_input(node)

Returns whether the given node has any constant input.

Parameters:

node (Node) –

Return type:

bool

has_path_type(node, graph, path_type, is_forward, wild_card_types=[], path_nodes=[])

Checks if the given node is start/end of a given forward/backward path type.

Note, Path can be forward or backward wrt a node depending on the next level nodes. Additionally, this method can work with optional nodes and collect the traversed path.

Parameters:
  • node (Node) – Start node of the path.

  • graph (Graph) – ONNX model graph.

  • path_type (List[str]) – Path types to match from the given node.

  • is_forward (bool) – Whether to match forward or backward path.

  • wild_card_types (List[str]) – Wild card types, these type of nodes are skipped and not matched with the path_type.

  • path_nodes (List[Node]) – Accumulated nodes in the matched path.

Returns:

Bool, whether the given node is start/end of the given forward/backward path type.

Return type:

bool

insert_fp8_mha_casts(onnx_model)

Insert three cast ops.

The first cast will be added before the input0 of MatMul to cast fp16 to fp32. The second cast will be added before the input1 of MatMul to cast fp16 to fp32. The third cast will be added after the output of MatMul to cast fp32 back to fp16. The insertion of Cast ops in the FP8 MHA part actually forbids the MHAs to run with FP16 accumulation because the compiler only has FP32 accumulation kernels for FP8 MHAs.

insert_matmul_casts(graph, matmul_node)

Insert three cast nodes for MatMul’s two inputs and output.

is_const_input(tensor)

Returns whether the given tensor is an initializer or produced by const-foldable nodes.

Parameters:

tensor (Tensor) –

Return type:

bool

print_stat(graph, verbose)

Collect and print stats of the quantized model.

Parameters:
  • graph (Graph) –

  • verbose (bool) –

Return type:

None

remove_partial_input_qdq(graph, no_quantize_inputs)

Modifies the onnx model by removing QDQ nodes from the marked inputs, ex. non-residual inputs etc.

Parameters:
  • graph (Graph) – Onnx model graph.

  • no_quantize_inputs (List[Tuple[Node, Node, str]]) – List non-quantizable input info as (src, dst, input_name)

Return type:

None