Skip to main content
Ctrl+K
TensorRT Edge-LLM - Home TensorRT Edge-LLM - Home

TensorRT Edge-LLM

TensorRT Edge-LLM - Home TensorRT Edge-LLM - Home

TensorRT Edge-LLM

Table of Contents

Getting Started

  • Overview
  • Supported Models
  • Installation
  • Quick Start Guide
  • Limitations and Known Issues

Examples

  • VLM (Vision-Language Model) Inference
  • Speculative Decoding
  • Phi-4-Multimodal
  • ASR (Automatic Speech Recognition)
  • MoE (Mixture of Experts)
  • TTS (Text-to-Speech)

Features

  • LoRA (Low-Rank Adaptation)
  • Vocabulary Reduction
  • FP8 KV Cache
  • System Prompt Cache

Input & Chat Format

  • Input JSON Format
  • Chat Template Format

Software Design

  • Python Export Pipeline
  • Engine Builder
  • C++ Runtime Overview
  • LLM Inference Runtime
  • LLM Inference SpecDecode Runtime

Customization

  • Customization Guide
  • TensorRT Plugins Guide

Testing

  • Code Coverage with SonarQube

APIs

  • Python API Reference
  • C++ API Reference
    • Builder Module
      • Audio Builder
      • Builder Utils
      • LLM Builder
      • Visual Builder
    • Common Module
      • Binding Names
      • Check Macros
      • CUDA Macros
      • CUDA Utils
      • File Utils
      • Hash Utils
      • Input Limits
      • Logger
      • Math Utils
      • MMAP Reader
      • Safetensors Utils
      • String Utils
      • Tensor
      • TRT Utils
      • Version
    • Kernels Module
      • Apply Rope Write KV
      • Batch Evict Kernels
      • Causal Conv1d
      • Common
      • Context FMHA Runner
      • Conversion
      • Cute Dsl FMHA Runner
      • Decoder XQA Runner
      • Dequant
      • Dequantize
      • EAGLE Accept Kernels
      • EAGLE Util Kernels
      • Embedding Kernels
      • FMHA Params V2
      • Image Util Kernels
      • Initialize Cos Sin Cache
      • Int4 Groupwise GEMM
      • Kernel
      • Kernel Selector
      • KV Cache Utils Kernels
      • Marlin
      • Marlin Dtypes
      • Marlin Mma
      • Marlin Template
      • Moe Activation Kernels
      • Moe Align Sum Kernels
      • Moe Marlin
      • Moe Marlin Indices Kernels
      • Moe Topk Softmax Kernels
      • Selective State Update
      • Talker Mlp Kernels
      • Util Kernels
      • Vectorized Types
    • Multimodal Module
      • Audio Runner
      • Audio Utils
      • Code2 Wav Runner
      • Image Utils
      • Intern ViT Runner
      • Model Types
      • Multimodal Runner
      • Phi4mm ViT Runner
      • Qwen ViT Runner
    • Plugins Module
      • Attention Plugin
      • Causal Conv1d Plugin
      • Int4 Groupwise GEMM Plugin
      • Int4 Moe Plugin
      • Mamba Plugin
      • Plugin Utils
      • VIT Attention Plugin
    • Profiling Module
      • Layer Profiler
      • Metrics
      • Nvtx Wrapper
      • Timer
    • Runtime Module
      • Audio Utils
      • EAGLE Draft Engine Runner
      • Image Utils
      • Linear KV Cache
      • LLM Engine Runner
      • LLM Inference Runtime
      • LLM Inference Spec Decode Runtime
      • LLM Runtime Utils
      • Qwen3 Omni Tts Runtime
    • Sampler Module
      • Sampling
    • Tokenizer Module
      • Pre Tokenizer
      • Token Encoder
      • Tokenizer
      • Tokenizer Utils
      • Unicode Data

Quick Links

  • Releases
  • GitHub
  • Roadmap
  • C++ API Reference
  • Runtime Module

Runtime Module#

API documentation for the runtime module.

  • Audio Utils
  • EAGLE Draft Engine Runner
  • Image Utils
  • Linear KV Cache
  • LLM Engine Runner
  • LLM Inference Runtime
  • LLM Inference Spec Decode Runtime
  • LLM Runtime Utils
  • Qwen3 Omni Tts Runtime

previous

Timer

next

Audio Utils

NVIDIA NVIDIA
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2025, Nvidia.

Last updated on March 14, 2026.

This page is generated by TensorRT-Edge-LLM commit d71c009.