Performance Benchmarks#

Platform: NVIDIA Jetson AGX Thor Developer Kit (Blackwell, SM110)

Definitions#

Term	Description
Prefill time	Average wall-clock time (ms) to process the input prompt
Prefill throughput	Prompt tokens processed per second during prefill (tok/s)
Generation throughput	Tokens generated per second during decoding (tok/s)
Batch size	Number of concurrent sequences (BS=1 = single-user latency, BS=8 = multi-user throughput)
Acceptance rate	Average tokens accepted per EAGLE verify step (higher is better)
Speedup	EAGLE generation throughput / vanilla generation throughput (same model, precision, batch size)
ViT time	Total visual encoder processing time per inference run (ms)
ViT throughput	Image tokens processed per second by the visual encoder (tok/s)
GPU memory	Peak GPU memory usage during inference (MB)

Precision Key#

Precision	Description	Platform Requirement
FP16	Half-precision float	All platforms
FP8	8-bit float	SM89+ (Ada Lovelace and newer)
INT4 AWQ	4-bit integer (AWQ quantization)	All platforms
INT4 GPTQ	4-bit integer (GPTQ quantization)	All platforms
NVFP4	NVIDIA 4-bit float	SM100+ (Blackwell and newer)

v0.7.0 Results#

SDK Version: TensorRT Edge-LLM 0.7.0 | TensorRT: 10.13

LLM — Vanilla Decoding#

Model	Precision	Batch	Prefill (ms)	Prefill Tokens	Prefill (tok/s)	Generation (tok/s)	GPU Mem (MB)
Qwen3-1.7B	NVFP4	1	13.9	370	26,683	170.4	1,453
Qwen3-1.7B	NVFP4	8	150.5	2,959	19,663	798.8	1,491
Qwen3-30B-A3B-GPTQ-Int4	INT4 GPTQ	1	125.3	370	2,951	81.3	15,938
Qwen3-30B-A3B-GPTQ-Int4	INT4 GPTQ	8	1,342.2	2,959	2,204	223.2	15,961
Nemotron-3-Nano-4B	NVFP4	1	126.8	383	3,018	65.4	3,647
Nemotron-3-Nano-4B	NVFP4	8	1,017.6	3,062	3,009	315.4	3,684

Vision Language Model — Vanilla Decoding#

Model	LLM Prec	ViT Prec	Prefill (ms)	Prefill Tokens	Prefill (tok/s)	Generation (tok/s)	GPU Mem (MB)
Qwen3.5-0.8B	NVFP4	FP16	7.0	753	107,571	232.2	1,052
Qwen3.5-2B	NVFP4	FP16	13.8	753	54,565	111.0	1,671
Qwen3.5-27B	NVFP4	FP16	122.6	753	6,143	10.5	14,985
Nemotron-3-Nano-Omni-30B-A3B	NVFP4	FP16	846.7	1,663	1,964	24.5	20,267

LLM — EAGLE Speculative Decoding#

Draft Models#

Base Model	Draft Model	Source
Qwen3-1.7B	Qwen3-1.7B_eagle3	AngelSlim/Qwen3-1.7B_eagle3

Note: Both base and draft models are quantized to NVFP4.

Model	Base Prec	Draft Prec	Batch	Prefill (ms)	Prefill Tokens	Generation (tok/s)	Accept Rate	Speedup
Qwen3-1.7B	NVFP4	NVFP4	1	14.5	370	312.4	3.75	1.83x
Qwen3-1.7B	NVFP4	NVFP4	8	153.5	2,959	828.8	3.73	1.04x

v0.4.0 Results#

SDK Version: TensorRT Edge-LLM 0.4.0 | TensorRT: 10.13

LLM — Vanilla Decoding#

Model	Precision	Batch	Prefill (ms)	Prefill Tokens	Prefill (tok/s)	Generation (tok/s)
Llama-3.1-8B-Instruct	INT4 AWQ	1	215.5	383	1,777	50.8
Llama-3.1-8B-Instruct	INT4 AWQ	8	2737.4	3064	1,119	135.3
Llama-3.1-8B-Instruct	NVFP4	1	31.0	383	12,355	54.9
Llama-3.1-8B-Instruct	NVFP4	8	387.6	3064	7,905	308.7
Qwen3-0.6B	INT4 AWQ	1	21.0	366	17,429	270.2
Qwen3-0.6B	INT4 AWQ	8	241.8	2927	12,104	828.0
Qwen3-0.6B	NVFP4	1	8.8	366	41,591	318.6
Qwen3-0.6B	NVFP4	8	95.4	2927	30,681	1562.4
Qwen3-4B-Instruct-2507	INT4 AWQ	1	116.2	364	3,133	76.4
Qwen3-4B-Instruct-2507	INT4 AWQ	8	1502.3	2911	1,938	240.3
Qwen3-4B-Instruct-2507	NVFP4	1	22.9	364	15,895	90.2
Qwen3-4B-Instruct-2507	NVFP4	8	301.9	2911	9,642	507.4
Qwen3-8B	INT4 AWQ	1	212.0	366	1,726	47.7
Qwen3-8B	INT4 AWQ	8	2719.1	2927	1,076	162.3
Qwen3-8B	NVFP4	1	32.8	366	11,159	53.7
Qwen3-8B	NVFP4	8	425.8	2927	6,874	372.2

Vision Language Model — Vanilla Decoding#

Model	LLM Prec	ViT Prec	Prefill (ms)	Prefill Tokens	Prefill (tok/s)	ViT Time (ms)	ViT Tok/Run	ViT (tok/s)	Generation (tok/s)
Qwen2.5-VL-7B-Instruct	INT4 AWQ	FP16	195.1	376	1,927	51.1	344	6,732	53.1
Qwen2.5-VL-7B-Instruct	INT4 AWQ	FP8	195.1	376	1,927	42.7	344	8,056	53.1
Qwen2.5-VL-7B-Instruct	NVFP4	FP16	25.7	376	14,631	51.0	344	6,745	57.7
Qwen2.5-VL-7B-Instruct	NVFP4	FP8	25.7	376	14,631	42.6	344	8,075	57.6
Qwen3-VL-2B-Instruct	INT4 AWQ	FP16	39.4	283	7,183	19.0	262	13,789	144.4
Qwen3-VL-2B-Instruct	INT4 AWQ	FP8	39.4	283	7,183	15.4	262	17,013	144.7
Qwen3-VL-2B-Instruct	NVFP4	FP16	10.1	283	28,020	19.0	262	13,789	180.8
Qwen3-VL-2B-Instruct	NVFP4	FP8	10.1	283	28,020	15.5	262	16,903	181.0

Note: ViT time = per-token ViT latency x image tokens per run. FP8 ViT reduces visual encoder time by ~17% compared to FP16 with negligible impact on generation throughput.

LLM — EAGLE Speculative Decoding#

Draft Models#

Base Model	Draft Model	Source
Llama-3.1-8B-Instruct	EAGLE3-LLaMA3.1-Instruct-8B	yuhuili/EAGLE3-LLaMA3.1-Instruct-8B
Qwen3-8B	qwen3_8b_eagle3	Tengyunw/qwen3_8b_eagle3

Note: Both base and draft models are quantized to the same precision (INT4 AWQ or NVFP4) as listed in the table below.

Model	Base Prec	Draft Prec	Batch	Prefill (ms)	Prefill Tokens	Generation (tok/s)	Accept Rate	Speedup
Llama-3.1-8B-Instruct	INT4 AWQ	INT4 AWQ	1	215.2	382	81.0	5.25	1.59x
Llama-3.1-8B-Instruct	INT4 AWQ	INT4 AWQ	8	2735.5	3056	118.0	5.21	0.87x
Llama-3.1-8B-Instruct	NVFP4	NVFP4	1	30.8	382	189.2	5.21	3.45x
Llama-3.1-8B-Instruct	NVFP4	NVFP4	8	413.1	3056	484.7	5.15	1.57x
Qwen3-8B	INT4 AWQ	INT4 AWQ	1	212.2	366	66.1	4.36	1.39x
Qwen3-8B	INT4 AWQ	INT4 AWQ	8	2719.1	2927	99.1	4.31	0.61x
Qwen3-8B	NVFP4	NVFP4	1	33.1	366	151.7	4.26	2.82x
Qwen3-8B	NVFP4	NVFP4	8	429.1	2927	457.7	4.25	1.23x

Note: EAGLE speculative decoding provides the greatest speedup at BS=1 (latency-bound). At BS=8, base model compute is already well-utilized, limiting speculative acceleration. See Speculative Decoding for setup instructions.

Vision Language Model — EAGLE Speculative Decoding#

Draft Models#

Base Model	Draft Model	Source
Qwen2.5-VL-7B-Instruct	qwen2.5-vl-7b-eagle3-sgl	Rayzl/qwen2.5-vl-7b-eagle3-sgl

Note: Both base and draft models are quantized to the same precision as listed in the table below.

Model	Base Prec	Draft Prec	ViT Prec	Prefill (ms)	Prefill Tokens	Generation (tok/s)	Accept Rate	Speedup
Qwen2.5-VL-7B-Instruct	INT4 AWQ	INT4 AWQ	FP16	195.1	376	57.3	3.66	1.08x
Qwen2.5-VL-7B-Instruct	NVFP4	NVFP4	FP16	25.8	376	149.6	3.82	2.59x
Qwen2.5-VL-7B-Instruct	NVFP4	NVFP4	FP8	32.8	376	117.3	3.76	2.04x

Key Observations#

v0.7.0#

MoE support: Qwen3-30B-A3B-GPTQ-Int4 (MoE, 3B active params out of 30B) achieves 81.3 tok/s at BS=1 and 223.2 tok/s at BS=8 with INT4 GPTQ, demonstrating efficient sparse model inference on edge.
Small model throughput: Qwen3-1.7B with NVFP4 delivers 170.4 tok/s at BS=1 and 798.8 tok/s at BS=8, suitable for latency-sensitive edge applications.
Qwen3.5 VLM family: Ranges from 232.2 tok/s (0.8B) to 10.5 tok/s (27B), providing a scalable VLM option across memory and throughput budgets.
Nemotron-3-Nano-Omni-30B-A3B: The first audio+video multimodal model benchmarked, achieving 24.5 tok/s generation at 20 GB GPU memory.

v0.4.0#

NVFP4 delivers highest throughput: NVFP4 achieves 1.1–2.3x higher generation throughput than INT4 AWQ, with substantially faster prefill (e.g., 31 ms vs 216 ms for Llama-3.1-8B at BS=1).
EAGLE at BS=1 provides meaningful speedup: 1.4–3.5x for LLMs, best for Llama-3.1-8B NVFP4 (3.45x). The draft model acceptance rate is high for Llama (~5.2 tokens/step) and moderate for Qwen3-8B (~4.3 tokens/step).
EAGLE at BS=8 has limited benefit: At high batch sizes, base model compute is already well-utilized. Speedup drops to <1x for INT4 AWQ and 1.2–1.6x for NVFP4.
Qwen3-0.6B achieves the highest throughput: 1562 tok/s at BS=8 with NVFP4 — a lightweight model well-suited for latency-sensitive edge applications.

General#

All benchmarks use default TensorRT Edge-LLM inference settings on Jetson AGX Thor. Production performance may vary with system-level tuning (power mode, memory configuration, thermal management).