.. _DirectML_Deployment: =================== DirectML =================== Once an ONNX FP16 model is quantized using TensorRT Model Optimizer on Windows, the resulting quantized ONNX model can be deployed on the DirectML backend via the `ONNX Runtime GenAI `_ or `ONNX Runtime `_. ONNX Runtime GenAI ================== ONNX Runtime GenAI offers a streamlined solution for deploying generative AI models with optimized performance and functionality. **Key Features**: - **Enhanced Optimizations**: Supports optimizations specific to generative AI, including efficient KV cache management and logits processing. - **Flexible Sampling Methods**: Offers various sampling techniques, such as greedy search, beam search, and top-p/top-k sampling, to suit different deployment needs. - **Control Options**: Use the high-level ``generate()`` method for rapid deployment or execute each iteration of the model in a loop for fine-grained control. - **Multi-Language API Support**: Provides APIs for Python, C#, and C/C++, allowing seamless integration across a range of applications. **Getting Started**: Refer to the `ONNX Runtime GenAI documentation `_ for an in-depth guide on installation, setup, and usage. **Examples**: - Explore `inference scripts `_ in the ORT GenAI example repository for generating output sequences using a single function call. - Follow the `ORT GenAI tutorials `_ for a step-by-step walkthrough of inference with DirectML using the ORT GenAI package (e.g., refer to the Phi3 tutorial). ONNX Runtime ============ Alternatively, the quantized model can be deployed using `ONNX Runtime `_. This method requires manual management of model inputs, including KV cache inputs and attention masks, for each iteration within the generation loop. **Examples and Documentation** For further details and examples, please refer to the `ONNX Runtime documentation `_. Collection of optimized ONNX models =================================== The ready-to-deploy optimized ONNX models from ModelOpt-Windows are available at HuggingFace `NVIDIA collections `_. These models can be deployed using DirectML backend. Follow the instructions provided along with the published models for deployment.