DirectML Deployment
Once an ONNX FP16 model is quantized using TensorRT Model Optimizer on Windows, the resulting quantized ONNX model can be deployed on the DirectML backend via the ONNX Runtime GenAI or ONNX Runtime.
ONNX Runtime GenAI
ONNX Runtime GenAI offers a streamlined solution for deploying generative AI models with optimized performance and functionality.
Key Features:
Enhanced Optimizations: Supports optimizations specific to generative AI, including efficient KV cache management and logits processing.
Flexible Sampling Methods: Offers various sampling techniques, such as greedy search, beam search, and top-p/top-k sampling, to suit different deployment needs.
Control Options: Use the high-level
generate()
method for rapid deployment or execute each iteration of the model in a loop for fine-grained control.Multi-Language API Support: Provides APIs for Python, C#, and C/C++, allowing seamless integration across a range of applications.
Getting Started:
Refer to the ONNX Runtime GenAI documentation for an in-depth guide on installation, setup, and usage.
Examples:
Explore inference scripts in the ORT GenAI example repository for generating output sequences using a single function call.
Follow the ORT GenAI tutorials for a step-by-step walkthrough of inference with DirectML using the ORT GenAI package (e.g., refer to the Phi3 tutorial).
ONNX Runtime
Alternatively, the quantized model can be deployed using ONNX Runtime. This method requires manual management of model inputs, including KV cache inputs and attention masks, for each iteration within the generation loop.
Examples and Documentation
For further details and examples, please refer to the ONNX Runtime documentation.
Collection of optimized ONNX models
The ready-to-deploy optimized ONNX models from ModelOpt-Windows are available at HuggingFace NVIDIA collections. These models can be deployed using DirectML backend. Follow the instructions provided along with the published models for deployment.