# PyTorch Backend ```{note} Note: This feature is currently experimental, and the related API is subjected to change in future versions. ``` To enhance the usability of the system and improve developer efficiency, TensorRT-LLM launches a new experimental backend based on PyTorch. The PyTorch backend of TensorRT-LLM is available in version 0.17 and later. You can try it via importing `tensorrt_llm._torch`. ## Quick Start Here is a simple example to show how to use `tensorrt_llm._torch.LLM` API with Llama model. ```{literalinclude} ../../examples/pytorch/quickstart.py :language: python :linenos: ``` ## Quantization The PyTorch backend supports FP8 and NVFP4 quantization. You can pass quantized models in HF model hub, which are generated by [TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer). ```python from tensorrt_llm._torch import LLM llm = LLM(model='nvidia/Llama-3.1-8B-Instruct-FP8') llm.generate("Hello, my name is") ``` Or you can try the following commands to get a quantized model by yourself: ```bash git clone https://github.com/NVIDIA/TensorRT-Model-Optimizer.git cd TensorRT-Model-Optimizer/examples/llm_ptq scripts/huggingface_example.sh --model --quant fp8 --export_fmt hf ``` ## Developer Guide - [Architecture Overview](./torch/arch_overview.md) - [Adding a New Model](./torch/adding_new_model.md) ## Key Components - [Attention](./torch/attention.md) - [KV Cache Manager](./torch/kv_cache_manager.md) - [Scheduler](./torch/scheduler.md) ## Known Issues - The PyTorch workflow on SBSA is incompatible with bare metal environments like Ubuntu 24.04. Please use the [PyTorch NGC Container (https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) for optimal support on SBSA platforms.