AutoCast (ONNX)
AutoCast is a tool for converting FP32 ONNX models to mixed precision FP32-FP16 or FP32-BF16 models. While casting FP32 to FP6/BF16, some nodes might be more sensitive to effecting accuracy. AutoCast intelligently selects nodes to keep in FP32 precision to maintain model accuracy while benefiting from reduced precision on the rest of the nodes. AutoCast automatically injects cast operations around the selected nodes.
Basic Commandline Usage
usage: python -m modelopt.onnx.autocast [-h] --onnx_path ONNX_PATH
[--output_path OUTPUT_PATH]
[--low_precision_type {fp16,bf16}]
[--calibration_data CALIBRATION_DATA]
[--nodes_to_exclude [NODES_TO_EXCLUDE ...]]
[--op_types_to_exclude [OP_TYPES_TO_EXCLUDE ...]]
[--data_max DATA_MAX]
[--init_max INIT_MAX]
[--init_conversion_max_bytes INIT_CONVERSION_MAX_BYTES]
[--keep_io_types]
[--log_level {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
Named Arguments
- --onnx_path
Path to the ONNX model
- --output_path
Output filename to save the converted ONNX model. If None, save it in the same dir as the original ONNX model with an appropriate suffix.
- --low_precision_type, -t
Possible choices: fp16, bf16
Precision to reduce to
Default:
'fp16'
- --calibration_data, -d
File path to inputs for reference runner, either NPZ or Polygraphy JSON file. If not provided, random inputs will be used
- --nodes_to_exclude, -n
List of regex patterns to match node names that should remain in FP32
Default:
[]
- --op_types_to_exclude, -op
List of op types that should remain in FP32
Default:
[]
- --data_max
Maximum absolute value for node outputs, nodes with outputs greater than this value will remain in FP32
Default:
512
- --init_max
Maximum absolute value for initializers, nodes with initializers greater than this value will remain in FP32
Default:
65500.0
- --init_conversion_max_bytes
Maximum size in bytes for initializer conversion. Larger initializers will be cast at runtime.
Default:
1048576
- --keep_io_types
Keep the input and output types of the model, otherwise they will be converted to FP16
Default:
False
- --log_level
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL
Log level
Default:
'INFO'
Python API Usage
AutoCast can also be used programmatically through its Python API:
import onnx
from modelopt.onnx.autocast import convert
# Convert model to mixed precision
converted_model = convert(
onnx_path="model.onnx",
low_precision_type="fp16", # or "bf16"
nodes_to_exclude=None, # optional list of node name patterns to keep in FP32
op_types_to_exclude=None, # optional list of op types to keep in FP32
data_max=512, # threshold for node outputs
init_max=65504, # threshold for initializers
keep_io_types=False, # whether to preserve input/output types
calibration_data=None, # optional path to input data file
init_conversion_max_bytes=1073741824, # maximum size in bytes for initializer conversion, 1<<20
)
# Save the converted model
onnx.save(converted_model, "converted_model.onnx")
How It Works
AutoCast follows these steps to convert a model:
Model Loading and Sanitization: - Loads the ONNX model - Performs graph sanitization and optimizations - Ensures minimum opset version requirements (22 for BF16, 13 for FP16)
Node Classification: - Analyzes each node in the graph - Determines which nodes should remain in FP32 based on input and output tensors magnitudes, operation types and node name patterns - If a calibration dataset is provided, it will be used to generate intermediate tensor magnitudes for more accurate node classification, otherwise random data will be used.
Precision Conversion: - Converts eligible nodes to lower precision - Automatically inserts necessary cast operations - Automatically replaces initializers with lower precision values
Validation and Export: - Verifying that the model is a valid ONNX model (using onnx.checker) - Checking that the output tensors are not disconnected - Verifying that the original and current network inputs/outputs names match - Ensuring that the input and output types are handled according to keep_io_types - Saves the converted model
Best Practices
Start with Default Settings: Begin with default thresholds and gradually adjust based on accuracy requirements.
Monitor Node Conversion: Use INFO level logging to see what percentage of nodes were converted to lower precision. Use DEBUG level logging to see more detailed information about the node classification process.
Preserve Critical Operations: Use
op_types_to_exclude
for operations known to be sensitive to precision reduction.Validate with Real Data: Provide representative input data using the
calibration_data
option for more accurate node classification.BF16 Conversion: - BF16 conversion is not supported for all operations. - AutoCast will automatically convert the model to opset 22 to enable more BF16 operations. - Use
--op_types_to_exclude
to exclude operations that are not supported in BF16. - BF16 accuracy may require additional tuning of thedata_max
andinit_max
thresholds. - TensorRT might not be able to support all BF16 converted models.Large Initializers - Attempting to convert large initializers, might cause host memory issues. - Use
--init_conversion_max_bytes
to limit the size of initializers that will be converted at compile time. - Initializers larger than--init_conversion_max_bytes
will be converted at runtime (using a cast operation). - Increasing this value may result in smaller models and faster inference, but could also result in AutoCast crash during the conversion process. - For best results, use the highest--init_conversion_max_bytes
that the host memory can handle.
Limitations and Restrictions
AutoCast does not yet support models with custom operators / plugins.
AutoCast does not yet support quantized models.
BF16 conversion is not supported for all operations
Large models (e.g. over 2GB) might cause memory issues.
Example Usage
Basic conversion to FP16:
python -m modelopt.onnx.autocast --onnx_path model.onnx
Basic conversion with verbose logging and custom output path:
python -m modelopt.onnx.autocast --onnx_path model.onnx --output_path custom_path.onnx --log_level DEBUG
Convert to BF16 with custom data magnitude threshold and custom disabled op types:
python -m modelopt.onnx.autocast --onnx_path model.onnx \
--low_precision_type bf16 \
--data_max 256 \
--op_types_to_exclude Resize
Bypass data magnitude check and keep specific node names in FP32:
python -m modelopt.onnx.autocast --onnx_path model.onnx --data_max inf --nodes_to_exclude ".*attn.*"