autotune
Pattern-Based Q/DQ Autotuning for ONNX Models.
This package provides automated optimization of Quantize/Dequantize (Q/DQ) node placement in ONNX computation graphs to minimize TensorRT inference latency. It uses pattern-based region analysis to efficiently explore and optimize Q/DQ insertion strategies.
Classes
Pattern-relative Q/DQ insertion point at a child region's input boundary (frozen/hashable). |
|
Pattern-relative Q/DQ insertion point at a child region or node output (frozen/hashable). |
|
Two-phase region search combining bottom-up partitioning with top-down refinement. |
|
Configuration parameters for QDQ autotuning. |
|
Complete Q/DQ insertion specification for a region pattern. |
|
Pattern-relative Q/DQ insertion point at a node's input (frozen/hashable). |
|
Pattern cache containing best-performing schemes for patterns with automatic eviction. |
|
Collection of Q/DQ insertion schemes for a single pattern. |
|
Q/DQ autotuner with automatic region discovery around compute-intensive ops. |
|
A subgraph region in an ONNX graph, used as the unit for Q/DQ insertion. |
|
Represents a structural pattern of a region. |
|
Region type enumeration for hierarchical graph structure. |
|
Resolved Q/DQ insertion point with actual tensor name and optional node context. |
|
TensorRT benchmark using Python API with plugin support. |
|
TensorRT benchmark using trtexec command-line tool. |
- exception AutotunerError
Bases:
ExceptionBase exception for autotuner-related errors.
- exception AutotunerNotInitializedError
Bases:
AutotunerErrorException raised when autotuner is used without initialization.
- class ChildRegionInputInsertionPoint
Bases:
InsertionPointPattern-relative Q/DQ insertion point at a child region’s input boundary (frozen/hashable).
Specifies where to insert Q/DQ pairs at the input boundaries of child regions within COMPOSITE regions. This allows parent regions to control quantization at child boundaries, potentially overriding or complementing child region optimizations.
Only applies to COMPOSITE regions; LEAF regions have no children.
This class is immutable (frozen) to allow safe use in sets and as dict keys.
- __init__(region_index, input_index)
- Parameters:
region_index (int)
input_index (int)
- Return type:
None
- static collect_from_region(region, graph)
Collect all valid child region input insertion points from a region.
- Parameters:
region (Region)
graph (Graph)
- Return type:
- classmethod from_dict(data)
Create from dictionary.
- Parameters:
data (dict[str, Any])
- Return type:
- input_index: int
- region_index: int
- resolve(region, graph)
Resolve a child region input insertion point to actual tensor names.
- Parameters:
region (Region)
graph (Graph)
- Return type:
- to_dict()
Convert to dictionary for serialization.
- Return type:
dict[str, Any]
- class ChildRegionOutputInsertionPoint
Bases:
InsertionPointPattern-relative Q/DQ insertion point at a child region or node output (frozen/hashable).
Specifies where to insert Q/DQ pairs at output boundaries. This can be either: 1. Output from a child region (in COMPOSITE regions) 2. Output from a node within the region
This class is immutable (frozen) to allow safe use in sets and as dict keys.
- __init__(region_index, node_index, output_index)
- Parameters:
region_index (int | None)
node_index (int | None)
output_index (int)
- Return type:
None
- static collect_from_region(region, graph)
Collect all valid region output insertion points from a region.
- Parameters:
region (Region)
graph (Graph)
- Return type:
- classmethod from_dict(data)
Create from dictionary.
- Parameters:
data (dict[str, Any])
- Return type:
- node_index: int | None
- output_index: int
- region_index: int | None
- resolve(region, graph)
Resolve a region output insertion point to actual tensor names.
- Parameters:
region (Region)
graph (Graph)
- Return type:
- to_dict()
Convert to dictionary for serialization.
- Return type:
dict[str, Any]
- class CombinedRegionSearch
Bases:
RegionSearchBaseTwo-phase region search combining bottom-up partitioning with top-down refinement.
This class implements a sophisticated region discovery algorithm that combines two complementary strategies to create well-formed, hierarchical regions from an ONNX computation graph.
- __init__(graph, maximum_sequence_region_size=10, minimum_topdown_search_size=10)
Initialize CombinedRegionSearch for a given ONNX graph.
- Parameters:
graph (Graph)
maximum_sequence_region_size (int)
minimum_topdown_search_size (int)
- class Config
Bases:
objectConfiguration parameters for QDQ autotuning.
Controls the autotuning process including performance requirements, quantization parameters, region building, scheme generation, and finetuning behavior.
Attributes are documented below as a list to avoid duplicate index entries with autodoc-generated attribute docs. Key fields:
verbose: Enable detailed logging of autotuning progress (default: False).
performance_threshold: Minimum speedup ratio to accept a scheme; 1.0 = no improvement required, 1.02 = 2% improvement (default: 1.02).
default_q_scale: Default scale for Q/DQ nodes; typical range 0.01-0.1 (default: 0.1).
default_q_zero_point: Zero-point for Q/DQ; 0 for int8, 128 for uint8 (default: 0).
default_quant_type: Quantization type; “int8” (default) or “fp8”.
default_dq_dtype: Dtype for DequantizeLinear output; “float32” (default) or “float16”.
maximum_sequence_region_size: Max nodes in a sequence region (default: 10).
minimum_topdown_search_size: Min nodes to trigger top-down search (default: 10).
top_percent_to_mutate: Top fraction of schemes used as mutation seeds (default: 0.1).
minimum_schemes_to_mutate: Min schemes to keep as mutation seeds (default: 10).
maximum_mutations: Max mutations per scheme during generation (default: 3).
maximum_generation_attempts: Max attempts to generate a unique scheme (default: 100).
pattern_cache_minimum_distance: Min edit distance between cached schemes (default: 4).
pattern_cache_max_entries_per_pattern: Max schemes per pattern in cache (default: 32).
- __init__(verbose=False, performance_threshold=1.02, default_q_scale=0.1, default_q_zero_point=0, default_quant_type='int8', default_dq_dtype='float32', maximum_sequence_region_size=10, minimum_topdown_search_size=10, top_percent_to_mutate=0.1, minimum_schemes_to_mutate=10, maximum_mutations=3, maximum_generation_attempts=100, pattern_cache_minimum_distance=4, pattern_cache_max_entries_per_pattern=32)
- Parameters:
verbose (bool)
performance_threshold (float)
default_q_scale (float)
default_q_zero_point (int)
default_quant_type (str)
default_dq_dtype (str)
maximum_sequence_region_size (int)
minimum_topdown_search_size (int)
top_percent_to_mutate (float)
minimum_schemes_to_mutate (int)
maximum_mutations (int)
maximum_generation_attempts (int)
pattern_cache_minimum_distance (int)
pattern_cache_max_entries_per_pattern (int)
- Return type:
None
- default_dq_dtype: str = 'float32'
- default_q_scale: float = 0.1
- default_q_zero_point: int = 0
- default_quant_type: str = 'int8'
- maximum_generation_attempts: int = 100
- maximum_mutations: int = 3
- maximum_sequence_region_size: int = 10
- minimum_schemes_to_mutate: int = 10
- minimum_topdown_search_size: int = 10
- pattern_cache_max_entries_per_pattern: int = 32
- pattern_cache_minimum_distance: int = 4
- performance_threshold: float = 1.02
- top_percent_to_mutate: float = 0.1
- verbose: bool = False
- class InsertionScheme
Bases:
objectComplete Q/DQ insertion specification for a region pattern.
An InsertionScheme defines a complete Q/DQ configuration for a pattern, combining both node-level and region-level insertion points. The scheme is applied to all regions matching the pattern.
- __init__(node_inputs=<factory>, child_region_inputs=<factory>, region_outputs=<factory>, latency_ms=inf, error=False, profile_timestamp=None)
- Parameters:
node_inputs (list[NodeInputInsertionPoint])
child_region_inputs (list[ChildRegionInputInsertionPoint])
region_outputs (list[ChildRegionOutputInsertionPoint])
latency_ms (float)
error (bool)
profile_timestamp (str | None)
- Return type:
None
- child_region_inputs: list[ChildRegionInputInsertionPoint]
- distance(other)
Compute edit distance between this scheme and another scheme.
The edit distance is the minimum number of add/remove operations needed to transform this scheme into the other scheme. This is computed as the symmetric difference between the insertion point sets.
- Parameters:
other (InsertionScheme) – InsertionScheme to compare against
- Returns:
Total edit distance (number of add + remove operations)
- Return type:
int
- error: bool = False
- classmethod from_dict(data)
Create InsertionScheme from serialized dictionary.
- Parameters:
data (dict[str, Any])
- Return type:
- property hash: str
Compute deterministic hash for scheme identity.
The hash uniquely identifies this scheme configuration based on its insertion points. Two schemes with identical insertion points produce the same hash, regardless of their measured latencies.
- property is_empty: bool
Check if this is a baseline scheme with no Q/DQ insertions.
- property is_profiled: bool
Check if this scheme has been profiled (measured).
A scheme is considered profiled if it has been measured (has non-infinite latency) or has encountered an error during measurement.
- latency_ms: float = inf
- node_inputs: list[NodeInputInsertionPoint]
- profile_timestamp: str | None = None
- region_outputs: list[ChildRegionOutputInsertionPoint]
- to_dict()
Convert to dictionary for serialization.
- Return type:
dict[str, Any]
- exception InvalidSchemeError
Bases:
AutotunerErrorException raised when an invalid scheme is referenced.
- class NodeInputInsertionPoint
Bases:
InsertionPointPattern-relative Q/DQ insertion point at a node’s input (frozen/hashable).
Specifies where to insert a Q/DQ pair within a region pattern using pattern-relative indices rather than absolute node IDs. This enables insertion scheme reuse across all regions matching the same pattern.
This class is immutable (frozen) to allow safe use in sets and as dict keys.
- __init__(node_index, input_index)
- Parameters:
node_index (int)
input_index (int)
- Return type:
None
- static collect_from_region(region, graph)
Collect all valid node input insertion points from a region.
- Parameters:
region (Region)
graph (Graph)
- Return type:
list[NodeInputInsertionPoint]
- classmethod from_dict(data)
Create from dictionary.
- Parameters:
data (dict[str, Any])
- Return type:
- input_index: int
- node_index: int
- resolve(region, graph)
Resolve a node input insertion point to actual tensor names for a matching region.
- Parameters:
region (Region)
graph (Graph)
- Return type:
- to_dict()
Convert to dictionary for serialization.
- Return type:
dict[str, Any]
- class PatternCache
Bases:
objectPattern cache containing best-performing schemes for patterns with automatic eviction.
Stores a collection of PatternSchemes that can be used as seeds for autotuning. Each PatternSchemes contains high-performing insertion schemes for a specific pattern signature. The cache automatically evicts non-performant schemes based on: - Error status (schemes with errors are evicted) - Duplicate schemes (only better-performing duplicate is kept) - Similarity (similar schemes where only better-performing one is kept) - Count limit (only top N best schemes are kept per pattern)
- __init__(pattern_schemes=<factory>, minimum_distance=4, max_entries_per_pattern=32)
- Parameters:
pattern_schemes (list[PatternSchemes])
minimum_distance (int)
max_entries_per_pattern (int)
- Return type:
None
- add_pattern_from_region(region, graph, quantized_tensors)
Build and add a pattern cache entry from a region in a quantized model.
Analyzes a region from an already-quantized model to extract its Q/DQ insertion scheme. This allows capturing known-good quantization strategies from existing models and using them as seeds for autotuning.
- Parameters:
region (Region) – Region from the quantized model to analyze
graph (Graph) – ONNX graph containing the region
quantized_tensors (set[str]) – Set of tensor names that have Q/DQ nodes
- Return type:
None
Example
>>> cache = PatternCache() >>> for region in all_regions: ... cache.add_pattern_from_region(region, graph, quantized_tensors) >>> cache.save("learned_patterns.yaml")
- add_pattern_schemes(pattern_schemes)
Add PatternSchemes to pattern cache with automatic eviction of non-performant entries.
Merges new schemes with existing schemes for the same pattern, automatically evicting schemes that are non-performant based on multiple criteria.
- Parameters:
pattern_schemes (PatternSchemes) – PatternSchemes to add to the cache
- Return type:
None
- classmethod from_dict(data)
Create PatternCache from serialized dictionary.
Note: RegionPattern objects are not restored (they’re runtime objects). Only pattern signatures and scheme data are loaded.
- Parameters:
data (dict[str, Any]) – Dictionary containing pattern cache data
- Returns:
Reconstructed PatternCache instance
- Return type:
- get_pattern_schemes(pattern_signature)
Get PatternSchemes for a specific pattern signature.
- Parameters:
pattern_signature (str) – Pattern signature to lookup
- Returns:
PatternSchemes if found, None otherwise
- Return type:
PatternSchemes | None
- has_pattern(pattern_signature)
Check if pattern cache contains a specific pattern.
- Parameters:
pattern_signature (str) – Pattern signature to check
- Returns:
True if pattern exists in pattern cache
- Return type:
bool
- classmethod load(input_path)
Load pattern cache from a YAML file.
Reads a previously saved pattern cache file and reconstructs all pattern schemes. The loaded pattern cache can be used to seed autotuning with known-good insertion schemes.
- Parameters:
input_path (str) – File path to the YAML pattern cache file to load
- Returns:
PatternCache instance with all pattern schemes loaded
- Raises:
FileNotFoundError – If the input_path doesn’t exist
- Return type:
- max_entries_per_pattern: int = 32
- merge(other, prefer_existing=True)
Merge another PatternCache into this one.
- Parameters:
other (PatternCache) – PatternCache to merge
prefer_existing (bool) – If True, keep existing patterns when there’s a conflict. If False, overwrite with other’s patterns.
- Return type:
None
- minimum_distance: int = 4
- property num_patterns: int
Get number of patterns in pattern cache.
- pattern_schemes: list[PatternSchemes]
- save(output_path)
Save pattern cache to a YAML file.
Serializes all pattern schemes and their insertion points to a YAML file that can be loaded later for seeded autotuning. The format matches the autotuner state file format for consistency.
- Parameters:
output_path (str) – File path where the YAML pattern cache file will be written
- Return type:
None
- to_dict()
Convert to dictionary for serialization.
- Returns:
Dictionary with ‘minimum_distance’, ‘max_entries_per_pattern’, and ‘pattern_schemes’ keys
- Return type:
dict[str, Any]
- property total_schemes: int
Get total number of schemes across all patterns.
- class PatternSchemes
Bases:
objectCollection of Q/DQ insertion schemes for a single pattern.
Manages multiple InsertionScheme candidates for a region pattern, tracking their performance and identifying the best-performing configuration. This enables pattern-based optimization where all regions with the same structure use the same Q/DQ insertion strategy.
Workflow: 1. Pattern is identified from region structure 2. Multiple schemes are generated and tested 3. Each scheme is measured (latency_ms) 4. Best scheme is selected (lowest latency) 5. Best scheme is applied to all matching regions
Best Scheme Selection: - Automatically identifies scheme with lowest latency - Excludes schemes with errors (error=True) - Schemes with latency_ms = inf are considered unmeasured - best_scheme property provides easy access to optimal configuration
- Attributes:
pattern: RegionPattern defining the structural signature schemes: List of InsertionScheme candidates with measurements
- __init__(pattern=None, schemes=<factory>)
- Parameters:
pattern (RegionPattern | None)
schemes (list[InsertionScheme])
- Return type:
None
- property best_scheme: InsertionScheme | None
Get the best performing scheme (lowest latency).
Scans all schemes to find the one with minimum latency_ms, excluding schemes with errors.
- Returns:
InsertionScheme with lowest latency (excluding error schemes), or None if no valid schemes exist
- classmethod from_dict(data, pattern=None)
Create PatternSchemes from serialized dictionary.
Reconstructs the pattern schemes collection from saved data. The RegionPattern object must be provided separately since it’s not serialized (it’s a runtime object computed from the graph).
If no pattern is provided, creates a minimal RegionPattern from the saved signature and size for signature matching purposes.
- Parameters:
data (dict[str, Any]) – Dictionary containing ‘pattern_signature’, ‘pattern_size’, and ‘schemes’ keys
pattern (RegionPattern | None) – RegionPattern object to associate (must match signature). If None, creates minimal pattern from saved data.
- Returns:
Reconstructed PatternSchemes instance
- Return type:
- property num_schemes: int
Get total number of schemes.
- pattern: RegionPattern | None = None
- property pattern_signature: str
Get the pattern signature string.
- property pattern_size: int
Get the pattern size (total node count).
- schemes: list[InsertionScheme]
- to_dict()
Convert to dictionary for serialization.
Note: Excludes runtime objects like pattern (RegionPattern). Only serializes metadata and schemes.
- Return type:
dict[str, Any]
- class QDQAutotuner
Bases:
QDQAutotunerBaseQ/DQ autotuner with automatic region discovery around compute-intensive ops.
- initialize(config=None, pattern_cache=None)
Initialize autotuner and discover optimization regions automatically.
Extends base class initialization by automatically searching for regions after configuration is set up. Regions are discovered using pattern-based search around compute-intensive operations.
- Parameters:
config (Config | None)
pattern_cache (PatternCache | None)
- Return type:
None
- class Region
Bases:
objectA subgraph region in an ONNX graph, used as the unit for Q/DQ insertion.
Regions form a hierarchy: ROOT contains the entire graph, COMPOSITE regions contain child regions, and LEAF regions contain only nodes. Each region tracks its direct nodes, input/output tensors, and a pattern signature for matching regions with identical structure.
- __init__(region_id, level, region_type)
Initialize a new region.
- Parameters:
region_id (int) – Unique identifier within the region hierarchy
level (int) – Hierarchical level (0 = leaf, higher = more composite)
region_type (RegionType) – Type classification (LEAF, COMPOSITE, or ROOT)
- contains_node(node_index)
Check if region contains a specific node (direct only).
- Parameters:
node_index (int)
- Return type:
bool
- contains_node_within_region_and_descendants(node_index)
Check if region contains a node recursively.
- Parameters:
node_index (int)
- Return type:
bool
- get_children(*, sort=False)
Get all child regions. If sort is True, sort the children by level and size.
- Parameters:
sort (bool) – Whether to sort the children by level and size
- Returns:
List of child regions
- Return type:
list[Region]
- get_nodes(*, sort=False)
Get direct node indices in this region only.
- Parameters:
sort (bool)
- Return type:
list[int]
- get_region_nodes_and_descendants(_visited=None)
Get all node indices recursively, including descendants.
- Parameters:
_visited (set[int] | None)
- Return type:
set[int]
- get_size_of_region_and_descendants(_visited=None)
Get total node count recursively including all descendants.
- Parameters:
_visited (set[int] | None)
- Return type:
int
- class RegionPattern
Bases:
objectRepresents a structural pattern of a region.
The pattern captures the topology and operation types in a region, enabling pattern matching and region comparison. Patterns are hashable and can be used as dictionary keys for efficient grouping and lookup.
- __init__(signature, size)
Initialize a region pattern.
- Parameters:
signature (str) – The structural signature of the region.
size (int) – The number of nodes in the region.
- format_tree(region, graph, indent=0)
Format this pattern and region as a human-readable tree.
Useful for debugging and visualization.
- Parameters:
region (Region) – The region associated with this pattern
graph (Graph) – The ONNX graph
indent (int) – Indentation level
- Returns:
Formatted string representation
- Return type:
str
- classmethod from_region(region, graph)
Compute a structural pattern for a region.
The pattern captures: - Direct node operations in the region - Structure of sub-regions (recursively) - Handles symmetric operations consistently - Sorts sub-regions by size for determinism
- Parameters:
region (Region) – The region to compute pattern for
graph (Graph) – The ONNX graph containing the nodes
- Returns:
RegionPattern object containing the signature and metadata
- Return type:
- get_full_insertion_scheme(region, graph)
Collect all possible insertion points for quantization in a region.
This method gathers all locations where Q/DQ nodes could be inserted within a region’s computational graph. These insertion points are organized into three categories: - node_inputs: Inputs to individual nodes within the region - child_region_inputs: Inputs to child regions within composite regions - region_outputs: Outputs from the region or its child regions
- Parameters:
region (Region) – The region to collect insertion points for
graph (Graph) – The ONNX graph containing the nodes
- Returns:
InsertionScheme object containing the insertion points
- Return type:
- get_hash()
Get a 128-bit cryptographic hash of the pattern signature.
- Return type:
str
- get_short_signature(max_length=80)
Get a truncated version of the signature for display purposes.
- Parameters:
max_length (int)
- Return type:
str
- property is_composite: bool
Check if the pattern represents a composite region.
- property is_empty: bool
Check if the pattern represents an empty region.
- property is_leaf: bool
Check if the pattern represents a leaf region (no composite structure).
- matches(other: RegionPattern) bool
- matches(other: Region, graph: Graph, scheme: None = None) list[int] | None
- matches(other: Region, graph: Graph, scheme: InsertionScheme) set[ResolvedInsertionPoint]
Check if this pattern matches another pattern or region.
This method provides three distinct behaviors depending on the arguments:
Pattern-to-pattern comparison (other is RegionPattern, scheme is None): Returns bool indicating structural equivalence.
Pattern-to-region matching (other is Region, scheme is None): Returns list of node IDs in pattern order if match succeeds, None otherwise.
Pattern-to-region with insertion scheme (other is Region, scheme provided): Returns set of resolved insertion points where Q/DQ should be inserted, considering: - NodeInputInsertionPoints from the scheme (node-level Q/DQ) - ChildRegionInputInsertionPoints from the scheme (child region input Q/DQ) - RegionOutputInsertionPoints from the scheme (region output Q/DQ) Returns empty set if pattern doesn’t match.
- Parameters:
other – Either a RegionPattern or Region to compare with
graph – Required when other is a Region (for computing its pattern)
scheme – Optional InsertionScheme containing node_inputs, child_region_inputs, and region_outputs to resolve to tensor names
- Returns:
True if other is RegionPattern and patterns match
List of node IDs in pattern order if other is Region and scheme is None, None if no match
Set of resolved insertion points for Q/DQ insertion if other is Region and scheme is provided
- Raises:
ValueError – If other is Region but graph is not provided, or if scheme is provided but other is not a Region
TypeError – If other is neither RegionPattern nor Region
- class RegionType
Bases:
EnumRegion type enumeration for hierarchical graph structure.
LEAF: Atomic region containing direct nodes with no child regions
COMPOSITE: Hierarchical region containing child regions (and optionally direct nodes)
ROOT: Top-level region encompassing the entire computation graph
- COMPOSITE = 'COMPOSITE'
- LEAF = 'LEAF'
- ROOT = 'ROOT'
- class ResolvedInsertionPoint
Bases:
objectResolved Q/DQ insertion point with actual tensor name and optional node context.
After resolving pattern-relative insertion points, this class represents the actual location where Q/DQ pairs should be inserted in the graph. It contains the tensor name and the node index (if applicable) and input index (if applicable).
This class is immutable (frozen) to allow safe use in sets and as dict keys.
- __init__(tensor_name, node_index=None, input_index=None)
- Parameters:
tensor_name (str)
node_index (int | None)
input_index (int | None)
- Return type:
None
- classmethod from_dict(data)
Create from dictionary.
- Parameters:
data (dict[str, Any])
- Return type:
- input_index: int | None = None
- node_index: int | None = None
- tensor_name: str
- to_dict()
Convert to dictionary for serialization.
- Return type:
dict[str, Any]
- class TensorRTPyBenchmark
Bases:
BenchmarkTensorRT benchmark using Python API with plugin support.
This implementation directly uses the TensorRT Python API to build engines and measure inference latency. It provides more control than trtexec and can be faster for certain workflows as it avoids subprocess overhead.
- __init__(timing_cache_file=None, warmup_runs=5, timing_runs=20, plugin_libraries=None)
Initialize the TensorRT Python API benchmark.
Creates persistent TensorRT objects (Logger, Builder, Runtime) and loads the timing cache from disk if available. Optionally loads custom TensorRT plugin libraries for models with custom operations.
- Parameters:
timing_cache_file (str | None) – Path to TensorRT timing cache file. If None, defaults to ‘/tmp/trtexec_timing.cache’.
warmup_runs (int) – Number of warmup iterations before timing measurements.
timing_runs (int) – Number of iterations for latency measurement.
plugin_libraries (list[str] | None) – List of paths to TensorRT plugin shared libraries (.so files). These plugins will be loaded and registered for use during engine building. If None, no custom plugins are loaded.
- Raises:
ImportError – If tensorrt or cuda-python (cudart) packages are not available.
FileNotFoundError – If a specified plugin library file does not exist.
RuntimeError – If plugin library loading fails.
- run(path_or_bytes, log_file=None, flush_timing_cache=False)
Run benchmark using TensorRT Python API.
- Parameters:
path_or_bytes (str | bytes) – Path to the ONNX model (str) or raw model data (bytes)
log_file (str | None) – Optional path to save benchmark logs
flush_timing_cache (bool) – If True, save the timing cache to disk after engine build.
- Returns:
Measured median latency in milliseconds, or float(“inf”) on any error (e.g. build failure, deserialization failure, buffer/stream allocation failure).
- Return type:
float
- set_shapes(input_name, min_shape, opt_shape, max_shape)
Set custom min/opt/max shapes for a dynamic input.
This method allows you to specify custom shape ranges for dynamic inputs (inputs with -1 dimensions). If not specified, the benchmark will use default shapes (all -1 dimensions become 1).
- Parameters:
input_name (str) – Name of the input tensor to configure.
min_shape (list) – Minimum shape for this input. List of integers.
opt_shape (list) – Optimal/default shape for this input. List of integers.
max_shape (list) – Maximum shape for this input. List of integers.
- class TrtExecBenchmark
Bases:
BenchmarkTensorRT benchmark using trtexec command-line tool.
This implementation uses the trtexec binary to build engines and measure inference latency. It is the most straightforward method and closely mirrors standard TensorRT workflows.
- __init__(timing_cache_file=None, warmup_runs=5, timing_runs=10, plugin_libraries=None, trtexec_path='trtexec', trtexec_args=None)
Initialize the trtexec benchmark.
- Parameters:
timing_cache_file (str | None) – See
Benchmark.__init__().warmup_runs (int) – See
Benchmark.__init__().timing_runs (int) – See
Benchmark.__init__().plugin_libraries (list[str] | None) – See
Benchmark.__init__().trtexec_path (str) – Path to trtexec binary. Defaults to ‘trtexec’ which looks for the binary in PATH.
trtexec_args (list[str] | None) – Additional command-line arguments to pass to trtexec. These are appended after the standard arguments. Example: [’–fp16’, ‘–workspace=4096’, ‘–verbose’]
- run(path_or_bytes, log_file=None, flush_timing_cache=False)
Run benchmark using trtexec.
- Parameters:
path_or_bytes (str | bytes) – Path to the ONNX model (str) or raw model data (bytes)
log_file (str | None) – Optional path to save trtexec logs
flush_timing_cache (bool)
- Returns:
Measured median latency in milliseconds
- Return type:
float