autotune

Pattern-Based Q/DQ Autotuning for ONNX Models.

This package provides automated optimization of Quantize/Dequantize (Q/DQ) node placement in ONNX computation graphs to minimize TensorRT inference latency. It uses pattern-based region analysis to efficiently explore and optimize Q/DQ insertion strategies.

Classes

ChildRegionInputInsertionPoint

Pattern-relative Q/DQ insertion point at a child region's input boundary (frozen/hashable).

ChildRegionOutputInsertionPoint

Pattern-relative Q/DQ insertion point at a child region or node output (frozen/hashable).

CombinedRegionSearch

Two-phase region search combining bottom-up partitioning with top-down refinement.

Config

Configuration parameters for QDQ autotuning.

InsertionScheme

Complete Q/DQ insertion specification for a region pattern.

NodeInputInsertionPoint

Pattern-relative Q/DQ insertion point at a node's input (frozen/hashable).

PatternCache

Pattern cache containing best-performing schemes for patterns with automatic eviction.

PatternSchemes

Collection of Q/DQ insertion schemes for a single pattern.

QDQAutotuner

Q/DQ autotuner with automatic region discovery around compute-intensive ops.

Region

A subgraph region in an ONNX graph, used as the unit for Q/DQ insertion.

RegionPattern

Represents a structural pattern of a region.

RegionType

Region type enumeration for hierarchical graph structure.

ResolvedInsertionPoint

Resolved Q/DQ insertion point with actual tensor name and optional node context.

TensorRTPyBenchmark

TensorRT benchmark using Python API with plugin support.

TrtExecBenchmark

TensorRT benchmark using trtexec command-line tool.

exception AutotunerError

Bases: Exception

Base exception for autotuner-related errors.

exception AutotunerNotInitializedError

Bases: AutotunerError

Exception raised when autotuner is used without initialization.

class ChildRegionInputInsertionPoint

Bases: InsertionPoint

Pattern-relative Q/DQ insertion point at a child region’s input boundary (frozen/hashable).

Specifies where to insert Q/DQ pairs at the input boundaries of child regions within COMPOSITE regions. This allows parent regions to control quantization at child boundaries, potentially overriding or complementing child region optimizations.

Only applies to COMPOSITE regions; LEAF regions have no children.

This class is immutable (frozen) to allow safe use in sets and as dict keys.

__init__(region_index, input_index)
Parameters:
  • region_index (int)

  • input_index (int)

Return type:

None

static collect_from_region(region, graph)

Collect all valid child region input insertion points from a region.

Parameters:
  • region (Region)

  • graph (Graph)

Return type:

list[ChildRegionInputInsertionPoint]

classmethod from_dict(data)

Create from dictionary.

Parameters:

data (dict[str, Any])

Return type:

ChildRegionInputInsertionPoint

input_index: int
region_index: int
resolve(region, graph)

Resolve a child region input insertion point to actual tensor names.

Parameters:
  • region (Region)

  • graph (Graph)

Return type:

set[ResolvedInsertionPoint]

to_dict()

Convert to dictionary for serialization.

Return type:

dict[str, Any]

class ChildRegionOutputInsertionPoint

Bases: InsertionPoint

Pattern-relative Q/DQ insertion point at a child region or node output (frozen/hashable).

Specifies where to insert Q/DQ pairs at output boundaries. This can be either: 1. Output from a child region (in COMPOSITE regions) 2. Output from a node within the region

This class is immutable (frozen) to allow safe use in sets and as dict keys.

__init__(region_index, node_index, output_index)
Parameters:
  • region_index (int | None)

  • node_index (int | None)

  • output_index (int)

Return type:

None

static collect_from_region(region, graph)

Collect all valid region output insertion points from a region.

Parameters:
  • region (Region)

  • graph (Graph)

Return type:

list[ChildRegionOutputInsertionPoint]

classmethod from_dict(data)

Create from dictionary.

Parameters:

data (dict[str, Any])

Return type:

ChildRegionOutputInsertionPoint

node_index: int | None
output_index: int
region_index: int | None
resolve(region, graph)

Resolve a region output insertion point to actual tensor names.

Parameters:
  • region (Region)

  • graph (Graph)

Return type:

set[ResolvedInsertionPoint]

to_dict()

Convert to dictionary for serialization.

Return type:

dict[str, Any]

class CombinedRegionSearch

Bases: RegionSearchBase

Two-phase region search combining bottom-up partitioning with top-down refinement.

This class implements a sophisticated region discovery algorithm that combines two complementary strategies to create well-formed, hierarchical regions from an ONNX computation graph.

__init__(graph, maximum_sequence_region_size=10, minimum_topdown_search_size=10)

Initialize CombinedRegionSearch for a given ONNX graph.

Parameters:
  • graph (Graph)

  • maximum_sequence_region_size (int)

  • minimum_topdown_search_size (int)

search_regions()

Execute two-phase region search to partition the graph into hierarchical regions.

  1. Bottom-up partitioning

  2. Top-down refinement

Parameters:

None

Returns:

List of hierarchical regions created from the graph

Return type:

list[Region]

class Config

Bases: object

Configuration parameters for QDQ autotuning.

Controls the autotuning process including performance requirements, quantization parameters, region building, scheme generation, and finetuning behavior.

Attributes are documented below as a list to avoid duplicate index entries with autodoc-generated attribute docs. Key fields:

  • verbose: Enable detailed logging of autotuning progress (default: False).

  • performance_threshold: Minimum speedup ratio to accept a scheme; 1.0 = no improvement required, 1.02 = 2% improvement (default: 1.02).

  • default_q_scale: Default scale for Q/DQ nodes; typical range 0.01-0.1 (default: 0.1).

  • default_q_zero_point: Zero-point for Q/DQ; 0 for int8, 128 for uint8 (default: 0).

  • default_quant_type: Quantization type; “int8” (default) or “fp8”.

  • default_dq_dtype: Dtype for DequantizeLinear output; “float32” (default) or “float16”.

  • maximum_sequence_region_size: Max nodes in a sequence region (default: 10).

  • minimum_topdown_search_size: Min nodes to trigger top-down search (default: 10).

  • top_percent_to_mutate: Top fraction of schemes used as mutation seeds (default: 0.1).

  • minimum_schemes_to_mutate: Min schemes to keep as mutation seeds (default: 10).

  • maximum_mutations: Max mutations per scheme during generation (default: 3).

  • maximum_generation_attempts: Max attempts to generate a unique scheme (default: 100).

  • pattern_cache_minimum_distance: Min edit distance between cached schemes (default: 4).

  • pattern_cache_max_entries_per_pattern: Max schemes per pattern in cache (default: 32).

__init__(verbose=False, performance_threshold=1.02, default_q_scale=0.1, default_q_zero_point=0, default_quant_type='int8', default_dq_dtype='float32', maximum_sequence_region_size=10, minimum_topdown_search_size=10, top_percent_to_mutate=0.1, minimum_schemes_to_mutate=10, maximum_mutations=3, maximum_generation_attempts=100, pattern_cache_minimum_distance=4, pattern_cache_max_entries_per_pattern=32)
Parameters:
  • verbose (bool)

  • performance_threshold (float)

  • default_q_scale (float)

  • default_q_zero_point (int)

  • default_quant_type (str)

  • default_dq_dtype (str)

  • maximum_sequence_region_size (int)

  • minimum_topdown_search_size (int)

  • top_percent_to_mutate (float)

  • minimum_schemes_to_mutate (int)

  • maximum_mutations (int)

  • maximum_generation_attempts (int)

  • pattern_cache_minimum_distance (int)

  • pattern_cache_max_entries_per_pattern (int)

Return type:

None

default_dq_dtype: str = 'float32'
default_q_scale: float = 0.1
default_q_zero_point: int = 0
default_quant_type: str = 'int8'
maximum_generation_attempts: int = 100
maximum_mutations: int = 3
maximum_sequence_region_size: int = 10
minimum_schemes_to_mutate: int = 10
minimum_topdown_search_size: int = 10
pattern_cache_max_entries_per_pattern: int = 32
pattern_cache_minimum_distance: int = 4
performance_threshold: float = 1.02
top_percent_to_mutate: float = 0.1
verbose: bool = False
class InsertionScheme

Bases: object

Complete Q/DQ insertion specification for a region pattern.

An InsertionScheme defines a complete Q/DQ configuration for a pattern, combining both node-level and region-level insertion points. The scheme is applied to all regions matching the pattern.

__init__(node_inputs=<factory>, child_region_inputs=<factory>, region_outputs=<factory>, latency_ms=inf, error=False, profile_timestamp=None)
Parameters:
Return type:

None

child_region_inputs: list[ChildRegionInputInsertionPoint]
distance(other)

Compute edit distance between this scheme and another scheme.

The edit distance is the minimum number of add/remove operations needed to transform this scheme into the other scheme. This is computed as the symmetric difference between the insertion point sets.

Parameters:

other (InsertionScheme) – InsertionScheme to compare against

Returns:

Total edit distance (number of add + remove operations)

Return type:

int

error: bool = False
classmethod from_dict(data)

Create InsertionScheme from serialized dictionary.

Parameters:

data (dict[str, Any])

Return type:

InsertionScheme

property hash: str

Compute deterministic hash for scheme identity.

The hash uniquely identifies this scheme configuration based on its insertion points. Two schemes with identical insertion points produce the same hash, regardless of their measured latencies.

property is_empty: bool

Check if this is a baseline scheme with no Q/DQ insertions.

property is_profiled: bool

Check if this scheme has been profiled (measured).

A scheme is considered profiled if it has been measured (has non-infinite latency) or has encountered an error during measurement.

latency_ms: float = inf
node_inputs: list[NodeInputInsertionPoint]
profile_timestamp: str | None = None
region_outputs: list[ChildRegionOutputInsertionPoint]
to_dict()

Convert to dictionary for serialization.

Return type:

dict[str, Any]

exception InvalidSchemeError

Bases: AutotunerError

Exception raised when an invalid scheme is referenced.

class NodeInputInsertionPoint

Bases: InsertionPoint

Pattern-relative Q/DQ insertion point at a node’s input (frozen/hashable).

Specifies where to insert a Q/DQ pair within a region pattern using pattern-relative indices rather than absolute node IDs. This enables insertion scheme reuse across all regions matching the same pattern.

This class is immutable (frozen) to allow safe use in sets and as dict keys.

__init__(node_index, input_index)
Parameters:
  • node_index (int)

  • input_index (int)

Return type:

None

static collect_from_region(region, graph)

Collect all valid node input insertion points from a region.

Parameters:
  • region (Region)

  • graph (Graph)

Return type:

list[NodeInputInsertionPoint]

classmethod from_dict(data)

Create from dictionary.

Parameters:

data (dict[str, Any])

Return type:

NodeInputInsertionPoint

input_index: int
node_index: int
resolve(region, graph)

Resolve a node input insertion point to actual tensor names for a matching region.

Parameters:
  • region (Region)

  • graph (Graph)

Return type:

set[ResolvedInsertionPoint]

to_dict()

Convert to dictionary for serialization.

Return type:

dict[str, Any]

class PatternCache

Bases: object

Pattern cache containing best-performing schemes for patterns with automatic eviction.

Stores a collection of PatternSchemes that can be used as seeds for autotuning. Each PatternSchemes contains high-performing insertion schemes for a specific pattern signature. The cache automatically evicts non-performant schemes based on: - Error status (schemes with errors are evicted) - Duplicate schemes (only better-performing duplicate is kept) - Similarity (similar schemes where only better-performing one is kept) - Count limit (only top N best schemes are kept per pattern)

__init__(pattern_schemes=<factory>, minimum_distance=4, max_entries_per_pattern=32)
Parameters:
  • pattern_schemes (list[PatternSchemes])

  • minimum_distance (int)

  • max_entries_per_pattern (int)

Return type:

None

add_pattern_from_region(region, graph, quantized_tensors)

Build and add a pattern cache entry from a region in a quantized model.

Analyzes a region from an already-quantized model to extract its Q/DQ insertion scheme. This allows capturing known-good quantization strategies from existing models and using them as seeds for autotuning.

Parameters:
  • region (Region) – Region from the quantized model to analyze

  • graph (Graph) – ONNX graph containing the region

  • quantized_tensors (set[str]) – Set of tensor names that have Q/DQ nodes

Return type:

None

Example

>>> cache = PatternCache()
>>> for region in all_regions:
...     cache.add_pattern_from_region(region, graph, quantized_tensors)
>>> cache.save("learned_patterns.yaml")
add_pattern_schemes(pattern_schemes)

Add PatternSchemes to pattern cache with automatic eviction of non-performant entries.

Merges new schemes with existing schemes for the same pattern, automatically evicting schemes that are non-performant based on multiple criteria.

Parameters:

pattern_schemes (PatternSchemes) – PatternSchemes to add to the cache

Return type:

None

classmethod from_dict(data)

Create PatternCache from serialized dictionary.

Note: RegionPattern objects are not restored (they’re runtime objects). Only pattern signatures and scheme data are loaded.

Parameters:

data (dict[str, Any]) – Dictionary containing pattern cache data

Returns:

Reconstructed PatternCache instance

Return type:

PatternCache

get_pattern_schemes(pattern_signature)

Get PatternSchemes for a specific pattern signature.

Parameters:

pattern_signature (str) – Pattern signature to lookup

Returns:

PatternSchemes if found, None otherwise

Return type:

PatternSchemes | None

has_pattern(pattern_signature)

Check if pattern cache contains a specific pattern.

Parameters:

pattern_signature (str) – Pattern signature to check

Returns:

True if pattern exists in pattern cache

Return type:

bool

classmethod load(input_path)

Load pattern cache from a YAML file.

Reads a previously saved pattern cache file and reconstructs all pattern schemes. The loaded pattern cache can be used to seed autotuning with known-good insertion schemes.

Parameters:

input_path (str) – File path to the YAML pattern cache file to load

Returns:

PatternCache instance with all pattern schemes loaded

Raises:

FileNotFoundError – If the input_path doesn’t exist

Return type:

PatternCache

max_entries_per_pattern: int = 32
merge(other, prefer_existing=True)

Merge another PatternCache into this one.

Parameters:
  • other (PatternCache) – PatternCache to merge

  • prefer_existing (bool) – If True, keep existing patterns when there’s a conflict. If False, overwrite with other’s patterns.

Return type:

None

minimum_distance: int = 4
property num_patterns: int

Get number of patterns in pattern cache.

pattern_schemes: list[PatternSchemes]
save(output_path)

Save pattern cache to a YAML file.

Serializes all pattern schemes and their insertion points to a YAML file that can be loaded later for seeded autotuning. The format matches the autotuner state file format for consistency.

Parameters:

output_path (str) – File path where the YAML pattern cache file will be written

Return type:

None

to_dict()

Convert to dictionary for serialization.

Returns:

Dictionary with ‘minimum_distance’, ‘max_entries_per_pattern’, and ‘pattern_schemes’ keys

Return type:

dict[str, Any]

property total_schemes: int

Get total number of schemes across all patterns.

class PatternSchemes

Bases: object

Collection of Q/DQ insertion schemes for a single pattern.

Manages multiple InsertionScheme candidates for a region pattern, tracking their performance and identifying the best-performing configuration. This enables pattern-based optimization where all regions with the same structure use the same Q/DQ insertion strategy.

Workflow: 1. Pattern is identified from region structure 2. Multiple schemes are generated and tested 3. Each scheme is measured (latency_ms) 4. Best scheme is selected (lowest latency) 5. Best scheme is applied to all matching regions

Best Scheme Selection: - Automatically identifies scheme with lowest latency - Excludes schemes with errors (error=True) - Schemes with latency_ms = inf are considered unmeasured - best_scheme property provides easy access to optimal configuration

Attributes:

pattern: RegionPattern defining the structural signature schemes: List of InsertionScheme candidates with measurements

__init__(pattern=None, schemes=<factory>)
Parameters:
Return type:

None

property best_scheme: InsertionScheme | None

Get the best performing scheme (lowest latency).

Scans all schemes to find the one with minimum latency_ms, excluding schemes with errors.

Returns:

InsertionScheme with lowest latency (excluding error schemes), or None if no valid schemes exist

classmethod from_dict(data, pattern=None)

Create PatternSchemes from serialized dictionary.

Reconstructs the pattern schemes collection from saved data. The RegionPattern object must be provided separately since it’s not serialized (it’s a runtime object computed from the graph).

If no pattern is provided, creates a minimal RegionPattern from the saved signature and size for signature matching purposes.

Parameters:
  • data (dict[str, Any]) – Dictionary containing ‘pattern_signature’, ‘pattern_size’, and ‘schemes’ keys

  • pattern (RegionPattern | None) – RegionPattern object to associate (must match signature). If None, creates minimal pattern from saved data.

Returns:

Reconstructed PatternSchemes instance

Return type:

PatternSchemes

property num_schemes: int

Get total number of schemes.

pattern: RegionPattern | None = None
property pattern_signature: str

Get the pattern signature string.

property pattern_size: int

Get the pattern size (total node count).

schemes: list[InsertionScheme]
to_dict()

Convert to dictionary for serialization.

Note: Excludes runtime objects like pattern (RegionPattern). Only serializes metadata and schemes.

Return type:

dict[str, Any]

class QDQAutotuner

Bases: QDQAutotunerBase

Q/DQ autotuner with automatic region discovery around compute-intensive ops.

initialize(config=None, pattern_cache=None)

Initialize autotuner and discover optimization regions automatically.

Extends base class initialization by automatically searching for regions after configuration is set up. Regions are discovered using pattern-based search around compute-intensive operations.

Parameters:
Return type:

None

class Region

Bases: object

A subgraph region in an ONNX graph, used as the unit for Q/DQ insertion.

Regions form a hierarchy: ROOT contains the entire graph, COMPOSITE regions contain child regions, and LEAF regions contain only nodes. Each region tracks its direct nodes, input/output tensors, and a pattern signature for matching regions with identical structure.

__init__(region_id, level, region_type)

Initialize a new region.

Parameters:
  • region_id (int) – Unique identifier within the region hierarchy

  • level (int) – Hierarchical level (0 = leaf, higher = more composite)

  • region_type (RegionType) – Type classification (LEAF, COMPOSITE, or ROOT)

add_child(child)

Add a child sub-region.

Parameters:

child (Region)

Return type:

None

contains_node(node_index)

Check if region contains a specific node (direct only).

Parameters:

node_index (int)

Return type:

bool

contains_node_within_region_and_descendants(node_index)

Check if region contains a node recursively.

Parameters:

node_index (int)

Return type:

bool

get_children(*, sort=False)

Get all child regions. If sort is True, sort the children by level and size.

Parameters:

sort (bool) – Whether to sort the children by level and size

Returns:

List of child regions

Return type:

list[Region]

get_nodes(*, sort=False)

Get direct node indices in this region only.

Parameters:

sort (bool)

Return type:

list[int]

get_region_nodes_and_descendants(_visited=None)

Get all node indices recursively, including descendants.

Parameters:

_visited (set[int] | None)

Return type:

set[int]

get_size_of_region_and_descendants(_visited=None)

Get total node count recursively including all descendants.

Parameters:

_visited (set[int] | None)

Return type:

int

is_descendant_of(potential_ancestor)

Check if this region is a descendant of potential_ancestor.

Parameters:

potential_ancestor (Region)

Return type:

bool

merge(other)

Merge another region into this one.

Parameters:

other (Region)

Return type:

None

remove_child(child)

Remove a child region from this region’s children list.

Parameters:

child (Region)

Return type:

bool

class RegionPattern

Bases: object

Represents a structural pattern of a region.

The pattern captures the topology and operation types in a region, enabling pattern matching and region comparison. Patterns are hashable and can be used as dictionary keys for efficient grouping and lookup.

__init__(signature, size)

Initialize a region pattern.

Parameters:
  • signature (str) – The structural signature of the region.

  • size (int) – The number of nodes in the region.

format_tree(region, graph, indent=0)

Format this pattern and region as a human-readable tree.

Useful for debugging and visualization.

Parameters:
  • region (Region) – The region associated with this pattern

  • graph (Graph) – The ONNX graph

  • indent (int) – Indentation level

Returns:

Formatted string representation

Return type:

str

classmethod from_region(region, graph)

Compute a structural pattern for a region.

The pattern captures: - Direct node operations in the region - Structure of sub-regions (recursively) - Handles symmetric operations consistently - Sorts sub-regions by size for determinism

Parameters:
  • region (Region) – The region to compute pattern for

  • graph (Graph) – The ONNX graph containing the nodes

Returns:

RegionPattern object containing the signature and metadata

Return type:

RegionPattern

get_full_insertion_scheme(region, graph)

Collect all possible insertion points for quantization in a region.

This method gathers all locations where Q/DQ nodes could be inserted within a region’s computational graph. These insertion points are organized into three categories: - node_inputs: Inputs to individual nodes within the region - child_region_inputs: Inputs to child regions within composite regions - region_outputs: Outputs from the region or its child regions

Parameters:
  • region (Region) – The region to collect insertion points for

  • graph (Graph) – The ONNX graph containing the nodes

Returns:

InsertionScheme object containing the insertion points

Return type:

InsertionScheme

get_hash()

Get a 128-bit cryptographic hash of the pattern signature.

Return type:

str

get_short_signature(max_length=80)

Get a truncated version of the signature for display purposes.

Parameters:

max_length (int)

Return type:

str

property is_composite: bool

Check if the pattern represents a composite region.

property is_empty: bool

Check if the pattern represents an empty region.

property is_leaf: bool

Check if the pattern represents a leaf region (no composite structure).

matches(other: RegionPattern) bool
matches(other: Region, graph: Graph, scheme: None = None) list[int] | None
matches(other: Region, graph: Graph, scheme: InsertionScheme) set[ResolvedInsertionPoint]

Check if this pattern matches another pattern or region.

This method provides three distinct behaviors depending on the arguments:

  1. Pattern-to-pattern comparison (other is RegionPattern, scheme is None): Returns bool indicating structural equivalence.

  2. Pattern-to-region matching (other is Region, scheme is None): Returns list of node IDs in pattern order if match succeeds, None otherwise.

  3. Pattern-to-region with insertion scheme (other is Region, scheme provided): Returns set of resolved insertion points where Q/DQ should be inserted, considering: - NodeInputInsertionPoints from the scheme (node-level Q/DQ) - ChildRegionInputInsertionPoints from the scheme (child region input Q/DQ) - RegionOutputInsertionPoints from the scheme (region output Q/DQ) Returns empty set if pattern doesn’t match.

Parameters:
  • other – Either a RegionPattern or Region to compare with

  • graph – Required when other is a Region (for computing its pattern)

  • scheme – Optional InsertionScheme containing node_inputs, child_region_inputs, and region_outputs to resolve to tensor names

Returns:

  • True if other is RegionPattern and patterns match

  • List of node IDs in pattern order if other is Region and scheme is None, None if no match

  • Set of resolved insertion points for Q/DQ insertion if other is Region and scheme is provided

Raises:
  • ValueError – If other is Region but graph is not provided, or if scheme is provided but other is not a Region

  • TypeError – If other is neither RegionPattern nor Region

class RegionType

Bases: Enum

Region type enumeration for hierarchical graph structure.

  • LEAF: Atomic region containing direct nodes with no child regions

  • COMPOSITE: Hierarchical region containing child regions (and optionally direct nodes)

  • ROOT: Top-level region encompassing the entire computation graph

COMPOSITE = 'COMPOSITE'
LEAF = 'LEAF'
ROOT = 'ROOT'
class ResolvedInsertionPoint

Bases: object

Resolved Q/DQ insertion point with actual tensor name and optional node context.

After resolving pattern-relative insertion points, this class represents the actual location where Q/DQ pairs should be inserted in the graph. It contains the tensor name and the node index (if applicable) and input index (if applicable).

This class is immutable (frozen) to allow safe use in sets and as dict keys.

__init__(tensor_name, node_index=None, input_index=None)
Parameters:
  • tensor_name (str)

  • node_index (int | None)

  • input_index (int | None)

Return type:

None

classmethod from_dict(data)

Create from dictionary.

Parameters:

data (dict[str, Any])

Return type:

ResolvedInsertionPoint

input_index: int | None = None
node_index: int | None = None
tensor_name: str
to_dict()

Convert to dictionary for serialization.

Return type:

dict[str, Any]

class TensorRTPyBenchmark

Bases: Benchmark

TensorRT benchmark using Python API with plugin support.

This implementation directly uses the TensorRT Python API to build engines and measure inference latency. It provides more control than trtexec and can be faster for certain workflows as it avoids subprocess overhead.

__init__(timing_cache_file=None, warmup_runs=5, timing_runs=20, plugin_libraries=None)

Initialize the TensorRT Python API benchmark.

Creates persistent TensorRT objects (Logger, Builder, Runtime) and loads the timing cache from disk if available. Optionally loads custom TensorRT plugin libraries for models with custom operations.

Parameters:
  • timing_cache_file (str | None) – Path to TensorRT timing cache file. If None, defaults to ‘/tmp/trtexec_timing.cache’.

  • warmup_runs (int) – Number of warmup iterations before timing measurements.

  • timing_runs (int) – Number of iterations for latency measurement.

  • plugin_libraries (list[str] | None) – List of paths to TensorRT plugin shared libraries (.so files). These plugins will be loaded and registered for use during engine building. If None, no custom plugins are loaded.

Raises:
  • ImportError – If tensorrt or cuda-python (cudart) packages are not available.

  • FileNotFoundError – If a specified plugin library file does not exist.

  • RuntimeError – If plugin library loading fails.

run(path_or_bytes, log_file=None, flush_timing_cache=False)

Run benchmark using TensorRT Python API.

Parameters:
  • path_or_bytes (str | bytes) – Path to the ONNX model (str) or raw model data (bytes)

  • log_file (str | None) – Optional path to save benchmark logs

  • flush_timing_cache (bool) – If True, save the timing cache to disk after engine build.

Returns:

Measured median latency in milliseconds, or float(“inf”) on any error (e.g. build failure, deserialization failure, buffer/stream allocation failure).

Return type:

float

set_shapes(input_name, min_shape, opt_shape, max_shape)

Set custom min/opt/max shapes for a dynamic input.

This method allows you to specify custom shape ranges for dynamic inputs (inputs with -1 dimensions). If not specified, the benchmark will use default shapes (all -1 dimensions become 1).

Parameters:
  • input_name (str) – Name of the input tensor to configure.

  • min_shape (list) – Minimum shape for this input. List of integers.

  • opt_shape (list) – Optimal/default shape for this input. List of integers.

  • max_shape (list) – Maximum shape for this input. List of integers.

class TrtExecBenchmark

Bases: Benchmark

TensorRT benchmark using trtexec command-line tool.

This implementation uses the trtexec binary to build engines and measure inference latency. It is the most straightforward method and closely mirrors standard TensorRT workflows.

__init__(timing_cache_file=None, warmup_runs=5, timing_runs=10, plugin_libraries=None, trtexec_path='trtexec', trtexec_args=None)

Initialize the trtexec benchmark.

Parameters:
  • timing_cache_file (str | None) – See Benchmark.__init__().

  • warmup_runs (int) – See Benchmark.__init__().

  • timing_runs (int) – See Benchmark.__init__().

  • plugin_libraries (list[str] | None) – See Benchmark.__init__().

  • trtexec_path (str) – Path to trtexec binary. Defaults to ‘trtexec’ which looks for the binary in PATH.

  • trtexec_args (list[str] | None) – Additional command-line arguments to pass to trtexec. These are appended after the standard arguments. Example: [’–fp16’, ‘–workspace=4096’, ‘–verbose’]

run(path_or_bytes, log_file=None, flush_timing_cache=False)

Run benchmark using trtexec.

Parameters:
  • path_or_bytes (str | bytes) – Path to the ONNX model (str) or raw model data (bytes)

  • log_file (str | None) – Optional path to save trtexec logs

  • flush_timing_cache (bool)

Returns:

Measured median latency in milliseconds

Return type:

float