.. _DirectML_Deployment:
===================
DirectML
===================
Once an ONNX FP16 model is quantized using TensorRT Model Optimizer on Windows, the resulting quantized ONNX model can be deployed on the DirectML backend via the `ONNX Runtime GenAI `_ or `ONNX Runtime `_.
ONNX Runtime GenAI
==================
ONNX Runtime GenAI offers a streamlined solution for deploying generative AI models with optimized performance and functionality.
**Key Features**:
- **Enhanced Optimizations**: Supports optimizations specific to generative AI, including efficient KV cache management and logits processing.
- **Flexible Sampling Methods**: Offers various sampling techniques, such as greedy search, beam search, and top-p/top-k sampling, to suit different deployment needs.
- **Control Options**: Use the high-level ``generate()`` method for rapid deployment or execute each iteration of the model in a loop for fine-grained control.
- **Multi-Language API Support**: Provides APIs for Python, C#, and C/C++, allowing seamless integration across a range of applications.
**Getting Started**:
Refer to the `ONNX Runtime GenAI documentation `_ for an in-depth guide on installation, setup, and usage.
**Examples**:
- Explore `inference scripts `_ in the ORT GenAI example repository for generating output sequences using a single function call.
- Follow the `ORT GenAI tutorials `_ for a step-by-step walkthrough of inference with DirectML using the ORT GenAI package (e.g., refer to the Phi3 tutorial).
ONNX Runtime
============
Alternatively, the quantized model can be deployed using `ONNX Runtime `_. This method requires manual management of model inputs, including KV cache inputs and attention masks, for each iteration within the generation loop.
**Examples and Documentation**
For further details and examples, please refer to the `ONNX Runtime documentation `_.
Collection of optimized ONNX models
===================================
The ready-to-deploy optimized ONNX models from ModelOpt-Windows are available at HuggingFace `NVIDIA collections `_. These models can be deployed using DirectML backend. Follow the instructions provided along with the published models for deployment.