graph_utils

Provides ONNX graph related utils for QDQ placement.

Functions

`build_non_residual_input_map`	Builds a map of non-residual Add input name to the Add node name from the given graph.
`cast_custom_ops`	Adds cast_to_fp16 nodes to the inputs and cast_to_fp32 to the outputs of a layer in the requested indices.
`classify_partition_nodes`	We should partially quantize the partition nodes with inputs outside of the partition.
`convert_fp16_io`	Convert graph I/O to FP16.
`expand_node_names_from_patterns`	Expand the node names from the given patterns.
`filter_quantizable_kgen_heads`	Returns the list of kgen head names if it follows a CASK partition.
`find_mha_partitions`	Match MHA: BMM1 -> .
`find_nodes_from_matmul_to_exclude`	Find MatMul nodes that meets gemv condition to exclude.
`find_nodes_from_mha_to_exclude`	Find MatMul nodes in MHA pattern to exclude.
`find_nodes_to_exclude`	Find the node names from the ONNX graph which matches user's exclusion patterns.
`get_concat_eliminated_tensors`	Find the input tensors and output tensor of concat that will be quantized.
`get_extended_model_outputs`	Run one inference step on an onnx model which has some intermediate tensor marked as model outputs.
`get_fusible_backbone`	Returns the linear backbone node for a given node if it matches the pattern.
`get_resize_scales`	Record Resize op's old scale value before converting to fp16.
`get_tensor_consumer_nodes`	Returns a dictionary of tensor name and their consumer node object mapping.
`get_tensor_from_name`	Returns a ValueInfoProto given a tensor name.
`get_tensor_producer_nodes`	Returns a dictionary of tensor name and their producer node object mapping.
`has_const_input`	Returns whether the given node has any constant input.
`has_path_type`	Checks if the given node is start/end of a given forward/backward path type.
`insert_fp8_mha_casts`	Insert three cast ops.
`insert_matmul_casts`	Insert three cast nodes for MatMul's two inputs and output.
`is_const_input`	Returns whether the given tensor is an initializer or produced by const-foldable nodes.
`match_fp8_mha_pattern`	Match FP8 fMHA v2 with the given softmax_op.
`print_stat`	Collect and print stats of the quantized model.
`remove_output_initializers`	Remove initializers that are also listed as graph outputs.
`remove_partial_input_qdq`	Modifies the onnx model by removing QDQ nodes from the marked inputs, ex.
`remove_redundant_cast_nodes`	Remove redundant Cast nodes from the ONNX graph to optimize model performance.

build_non_residual_input_map(graph)

Builds a map of non-residual Add input name to the Add node name from the given graph.

This assumes that the Add layer only has 2 inputs.

We will refer to a subgraph which has a Convolution node with a single output that is summed (element-wise) with another non-constant input-tensor as a “residual-add” subgraph, because it occurs in modern convnets that use residual connections.

Parameters:: graph (Graph) – Onnx model graph.
Returns:: Dictionary of Add node names vs their non-residual input name. List of partially-quantizable inputs with non-quantizable input info as (src, dst, input_name)
Return type:: tuple[dict[str, str], list[tuple[Node, Node, str]]]

cast_custom_ops(onnx_model, ops_to_cast)

Adds cast_to_fp16 nodes to the inputs and cast_to_fp32 to the outputs of a layer in the requested indices.

Parameters:

onnx_model (ModelProto)
ops_to_cast (dict)

Return type:

ModelProto

classify_partition_nodes(partitions)

We should partially quantize the partition nodes with inputs outside of the partition.

Parameters:: partitions (list[list[Node]]) – Partitions created by modelopt ptq algo.
Returns:: List of non-quantizable nodes. List of quantizable nodes. List of partially-quantizable inputs with non-quantizable input info as (src, dst, input_name)
Return type:: tuple[list[Node], list[Node], list[tuple[Node, Node, str]]]

convert_fp16_io(graph): Convert graph I/O to FP16.

expand_node_names_from_patterns(graph, name_patterns=None)

Expand the node names from the given patterns.

Parameters:

graph (GraphProto | Graph)
name_patterns (list[str] | None)

Return type:

list[str]

filter_quantizable_kgen_heads(cask_fusible_partitions, kgen_partitions, quantizable_op_types, graph)

Returns the list of kgen head names if it follows a CASK partition.

Parameters:

cask_fusible_partitions (list[list[Node]])
kgen_partitions (list[list[Node]])
quantizable_op_types (list[str])
graph (Graph)

Return type:

tuple[list[Node], list[tuple[Node, Node, str]]]

find_mha_partitions(graph): Match MHA: BMM1 -> … -> Softmax -> … -> BMM2.

find_nodes_from_matmul_to_exclude(onnx_path, use_external_data_format=False, intermediate_generated_files=None, calibration_data_reader=None, calibration_eps=['cpu', 'cuda:0', 'trt'], calibration_shapes=None)

Find MatMul nodes that meets gemv condition to exclude.

Either of m or n in matmul is 1, this matmul cannot utilize TensorCores. The perf of adding Q/DQ layers is not good in TRT. Thus, in this case, do not add Q/DQ layers to this matmul.

Parameters:

onnx_path (str) – Path to the onnx model.
use_external_data_format (bool) – If True, external data path will be used to store the weights of the intermediate model.
intermediate_generated_files (list[str] | None) – List of intermediate generated files that will be deleted after quantization.
calibration_data_reader (CalibrationDataReader) – Calibration data reader for running inference.
calibration_shapes (str | None) – Model input shapes for inference. If provided, symbolic shape inference will be used instead of calibration_data_reader.
calibration_eps (list[str]) – Priority order for the execution providers (EP) to calibrate the model. Any subset of [‘cuda:x’, ‘cpu’, ‘trt’], where ‘x’ is the device id.

Returns:

List of Nodes to exclude from quantization.

Return type:

list[str]

find_nodes_from_mha_to_exclude(onnx_path, use_external_data_format=False, nodes_to_exclude=None, disable_mha_qdq=False, quantize_mode='int8', intermediate_generated_files=None, calibration_data_reader=None, calibration_eps=['cpu', 'cuda:0', 'trt'])

Find MatMul nodes in MHA pattern to exclude.

If disable_mha_qdq is set, don’t add Q/DQ layers to MatMuls in MHA pattern. else when quantize_mode == “fp8”, if head_size > 256 or head_size <= 8 or mha doesn’t meet fp8 fMHA v2 pattern, don’t add Q/DQ layers to MatMuls in MHA pattern. else when quantize_mode == “int8”, if seq_len > 512, don’t add Q/DQ layers to MatMuls in MHA pattern.

Parameters:

onnx_path (str) – Path to the onnx model.
use_external_data_format (bool) – If True, external data path will be used to store the weights of the intermediate model.
nodes_to_exclude (list[str] | None) – List of Nodes to exclude from quantization.
disable_mha_qdq (bool) – If True, all MHA’s BMM1 and BMM2 will be added to nodes_to_exclude. Else, each MHA will be checked whether to enable QDQ or not when is_fp8fp16 is True.
quantize_mode (str) – Quantization mode. One of ‘int8’ (default), ‘int4’ and ‘fp8’.
intermediate_generated_files (list[str] | None) – List of intermediate generated files that will be deleted after quantization.
calibration_data_reader (CalibrationDataReader) – Calibration data reader for running inference.
calibration_eps (list[str]) – Priority list of execution providers (EP) for calibration.

Returns:

List of Nodes to exclude from quantization.

Return type:

list[str]

find_nodes_to_exclude(graph, nodes_to_exclude, op_types_to_exclude)

Find the node names from the ONNX graph which matches user’s exclusion patterns.

Parameters:

graph (Graph)
nodes_to_exclude (list[str])
op_types_to_exclude (list[str])

get_concat_eliminated_tensors(onnx_model, nodes_to_quantize)

Find the input tensors and output tensor of concat that will be quantized.

We can do some perf optimization for TRT.

For example, like the below pattern: (t1) q1 -> dq1 (t2) q2 -> dq2 -> concat -> q4 -> dq4 (t4) (t3) q3 -> dq3 /

In TRT, q4 will be propagated forward concat. It will be like: (t1) q1 -> dq1 -> q4 (t2) q2 -> dq2 -> q4 -> concat -> dq4 (t4) (t3) q3 -> dq3 -> q4 /

If the scaling factor of dq1 and q4 are different, it will cause the dq-q compute latency. If they are the same, then the dq-q pairs can be eliminated in TRT, and no extra dq-q compute latency. However, it will sacrifice the accuracy.

Thus, this function will collect which tensors should have the same scaling factors. For the above example, we want the scaling factor of dq1, dq2, dq3, q4 be the same. This function will return like { t1: {t1,t2,t3,t4}, t2: {t1,t2,t3,t4}, t3: {t1,t2,t3,t4}, t4: {t1,t2,t3,t4}, } This format is convinient for calibrator to assign the same scaling factor.

Returns:

set of tensors that should share the same scaling factor}

Return type:

{current tensor name

Parameters:

onnx_model (ModelProto)
nodes_to_quantize (list[str])

get_extended_model_outputs(onnx_path, extended_model, use_external_data_format, intermediate_generated_files, calibration_data_reader, calibration_eps)

Run one inference step on an onnx model which has some intermediate tensor marked as model outputs.

The first calibration data is used for the dummy inference. This is useful when we want to know the shape of an intermediate tensor given the calibration data.

Parameters:

onnx_path (str) – Path to the original onnx model, used for saving the extended model nearby if it is larger than 2GB.
extended_model (ModelProto) – The onnx model with some intermediate tensors marked as model outputs.
use_external_data_format (bool) – If True, external data path will be used to store the weights of the intermediate model.
intermediate_generated_files (list[str]) – List of intermediate generated files that will be deleted after quantization.
calibration_data_reader (CalibrationDataReader) – Calibration data reader for running inference.
calibration_eps (list[str]) – Priority order for the execution providers (EP) to calibrate the model. Any subset of [‘cuda:x’, ‘cpu’, ‘trt’], where ‘x’ is the device id.

Return type:

dict[str, ndarray]

Returns: a map with each output name pointed to the corresponding output numpy ndarray.

get_fusible_backbone(node, graph)

Returns the linear backbone node for a given node if it matches the pattern.

TensorRT fuses convolution with BN, Relu etc. when in some specific pattern. This rule tries to match some of those patterns. Note. BiasAdd and ConstMul are optional in path types.

Parameters:

node (Node) – Start node of the pattern.
graph (Graph) – ONNX model graph.

Returns:

Backbone node of the given node, None if not found.

Return type:

Node | None

get_resize_scales(onnx_model)

Record Resize op’s old scale value before converting to fp16.

Because low precision scale will lead to wrong shape. For example, if 7 is resized to 6, fp32 scale should be 6/7 = 0.85714. After converting to fp16, it becomes 0.85693 but 7 * 0.85693 = 5.9985 < 6.

get_tensor_consumer_nodes(graph)

Returns a dictionary of tensor name and their consumer node object mapping.

Parameters:: graph (GraphProto) – ONNX model graph.
Returns:: Dictionary, key is tensor name and value is their consumer node object
Return type:: dict[str, list[NodeProto]]

get_tensor_from_name(graph, tensor_name)

Returns a ValueInfoProto given a tensor name.

Parameters:

graph (GraphProto) – ONNX model graph
tensor_name (str) – String with tensor name.

Returns:

actual graph tensor.

Return type:

onnx.ValueInfoProto

get_tensor_producer_nodes(graph)

Returns a dictionary of tensor name and their producer node object mapping.

Note. we create a special Root type node as external inputs producer for ease of implementation.

Parameters:: graph (GraphProto) – ONNX model graph.
Returns:: Dictionary, key is tensor name and value is their producer node object
Return type:: dict[str, NodeProto]

has_const_input(node)

Returns whether the given node has any constant input.

Parameters:: node (Node)
Return type:: bool

has_path_type(node, graph, path_type, is_forward, wild_card_types=[], path_nodes=[])

Checks if the given node is start/end of a given forward/backward path type.

Note, Path can be forward or backward wrt a node depending on the next level nodes. Additionally, this method can work with optional nodes and collect the traversed path.

Parameters:

node (Node) – Start node of the path.
graph (Graph) – ONNX model graph.
path_type (list[str]) – Path types to match from the given node.
is_forward (bool) – Whether to match forward or backward path.
wild_card_types (list[str]) – Wild card types, these type of nodes are skipped and not matched with the path_type.
path_nodes (list[Node]) – Accumulated nodes in the matched path.

Returns:

Bool, whether the given node is start/end of the given forward/backward path type.

Return type:

bool

insert_fp8_mha_casts(onnx_model)

Insert three cast ops.

The first cast will be added before the input0 of MatMul to cast fp16 to fp32. The second cast will be added before the input1 of MatMul to cast fp16 to fp32. The third cast will be added after the output of MatMul to cast fp32 back to fp16. The insertion of Cast ops in the FP8 MHA part actually forbids the MHAs to run with FP16 accumulation because the compiler only has FP32 accumulation kernels for FP8 MHAs.

insert_matmul_casts(graph, matmul_node): Insert three cast nodes for MatMul’s two inputs and output.

is_const_input(tensor)

Returns whether the given tensor is an initializer or produced by const-foldable nodes.

Parameters:: tensor (Tensor)
Return type:: bool

match_fp8_mha_pattern(graph, softmax_op, has_fp8_qdq)

Match FP8 fMHA v2 with the given softmax_op.

If has_fp8_qdq == True, we match this FP8 fMHA v2 pattern: Q -> DQ -> BMM1 -> (Mul/Div) -> (Add) -> Softmax -> (Cast) -> Q -> DQ -> BMM2 -> Q -> DQ. If has_fp8_qdq == False, we match this FP8 fMHA v2 pattern: BMM1 -> (Mul/Div) -> (Add) -> Softmax -> (Cast) -> BMM2.

Parameters:

graph (Graph) – The graph to match FP8 MHA pattern.
softmax_op (Node) – The softmax op of FP8 MHA we want to match.
nodes_to_exclude – List of Nodes to exclude from quantization.
has_fp8_qdq (bool) – If True, match the FP8 MHA with Q/DQs. Else, match the FP8 MHA without Q/DQs.

Returns:

List of BMM1 node, Softmax node and BMM2 node.

Return type:

list[Node]

print_stat(graph)

Collect and print stats of the quantized model.

Parameters:: graph (Graph)
Return type:: None

remove_output_initializers(graph, graph_initializers)

Remove initializers that are also listed as graph outputs.

Having initializers (constant tensors) that are also marked as outputs can lead to ONNX Runtime or conversion tool errors, particularly related to ambiguous ‘dtype’ or shape inference. This step ensures compatibility by detaching such initializers from the graph’s outputs.

Parameters:

graph (Graph)
graph_initializers (list)

remove_partial_input_qdq(graph, no_quantize_inputs)

Modifies the onnx model by removing QDQ nodes from the marked inputs, ex. non-residual inputs etc.

Parameters:

graph (Graph) – Onnx model graph.
no_quantize_inputs (list[tuple[Node, Node, str]]) – List non-quantizable input info as (src, dst, input_name)

Return type:

None

remove_redundant_cast_nodes(graph)

Remove redundant Cast nodes from the ONNX graph to optimize model performance.

This function identifies and removes two types of redundant Cast nodes:

Cast nodes where input and output types are identical - Before: t1 (dtype=fp16) -> cast (to=fp16) -> t2 -> Op - After: t1 (dtype=fp16) -> Op
Cast nodes that can be fused with initializers - Before: (initializer) t1 (dtype=fp32) -> cast (to=fp16) -> t2 -> Op - After: (initializer) t1 (dtype=fp16) -> Op

The function preserves Cast nodes that: - Have outputs that are graph outputs - Are necessary for type conversion - Have dynamic inputs (not initializers)

Parameters:: graph (GraphProto) – ONNX graph to optimize. The graph will be modified in-place.
Return type:: None

Note

This optimization is particularly useful for models with many Cast operations
The function modifies the graph in-place
All tensor consumers are updated to maintain graph connectivity
Initializer data types are converted when possible to eliminate Cast nodes