Performance Benchmarks#

Platforms: NVIDIA Jetson AGX Thor Developer Kit, Jetson AGX Orin 64GB, Jetson Orin NX 16GB, and Jetson Orin Nano 8GB. Older release sections may include a narrower platform set.

Definitions#

Term

Description

Prefill time

Average wall-clock time (ms) to process the input prompt

Prefill throughput

Prompt tokens processed per second during prefill (tok/s)

Generation throughput

Tokens generated per second during decoding (tok/s)

Batch size

Number of concurrent sequences (BS=1 = single-user latency, BS=8 = multi-user throughput)

Acceptance rate

Average tokens accepted per speculative decoding verify step (higher is better)

Speedup

Speculative decoding generation throughput / vanilla generation throughput (same model, precision, batch size)

ViT time

Total visual encoder processing time per inference run (ms)

ViT throughput

Image tokens processed per second by the visual encoder (tok/s)

GPU memory

Peak GPU memory usage during inference (MB)

MTP

Multi-token prediction speculative decoding

DFlash

z-lab paired-draft speculative decoding with a dedicated external draft checkpoint

Precision Key#

Precision

Description

Platform Requirement

FP16

Half-precision float

All platforms

FP8

8-bit float

SM89+ (Ada Lovelace and newer)

INT4 AWQ

4-bit integer (AWQ quantization)

All platforms

INT4 GPTQ

4-bit integer (GPTQ quantization)

All platforms

NVFP4

NVIDIA 4-bit float

SM100+ (Blackwell and newer)


v0.8.0 Results#

SDK Version: TensorRT Edge-LLM 0.8.0  |  JetPack: 7.2  |  Source: v0.8.0 release benchmark outputs  |  Devices: Jetson AGX Thor, Jetson AGX Orin 64GB, Jetson Orin NX 16GB, Jetson Orin Nano 8GB

Limitation: v0.8.0 has a uniform performance regression across the benchmarked release devices. This regression is fixed in v0.9.0, so use v0.9.0 or later for current performance expectations.

llm_bench Commands#

The v0.8.0 release benchmarks use llm_bench for synthetic component timing. The release run used --warmup=2, --iterations=10, and --profile; replace paths and lengths with the engine directory and shape listed in the table.

# Prefill throughput
./build/examples/llm/llm_bench \
    --engineDir <llm_engine_dir> \
    --mode prefill \
    --batchSize <batch_size> \
    --inputLen <input_len> \
    --warmup 2 \
    --iterations 10 \
    --profile

# Decode throughput
./build/examples/llm/llm_bench \
    --engineDir <llm_engine_dir> \
    --mode decode \
    --batchSize <batch_size> \
    --pastKVLen <past_kv_len> \
    --warmup 2 \
    --iterations 10 \
    --profile

# Speculative decoding component timing
./build/examples/llm/llm_bench --engineDir <engine_dir> --mode spec_draft_prefill --batchSize <batch_size> --inputLen <input_len> --warmup 2 --iterations 10 --profile
./build/examples/llm/llm_bench --engineDir <engine_dir> --mode spec_draft_proposal --batchSize <batch_size> --draftTreeSize <draft_tree_size> --pastKVLen <past_kv_len> --warmup 2 --iterations 10 --profile
./build/examples/llm/llm_bench --engineDir <engine_dir> --mode spec_verify --batchSize <batch_size> --verifyTreeSize <verify_tree_size> --pastKVLen <past_kv_len> --warmup 2 --iterations 10 --profile

# Visual encoder timing
./build/examples/llm/llm_bench --engineDir <visual_engine_dir> --mode visual --imageSize <height>x<width> --warmup 2 --iterations 10 --profile

llm_bench Prefill Performance (Jetson AGX Thor Only)#

These are the parsed synthetic llm_bench_prefill_* results in the v0.8.0 release benchmark outputs. Only the Jetson AGX Thor data contains parsed llm_bench_prefill_e2e_time_ms and llm_bench_prefill_tokens_per_sec values. The AGX Orin, Orin NX, and Orin Nano data includes runtime prefill metrics in the dashboard below, but does not include those synthetic llm_bench prefill e2e/tok/s fields.

Platform

Model

Kind

Mode

Precision

Batch

Input Len

Prefill E2E (ms)

Prefill (tok/s)

Jetson AGX Thor

NVIDIA-Nemotron-3-Nano-30B-A3B

LLM

Vanilla

NVFP4

1

2,048

228.1

8,978.9

Jetson AGX Thor

NVIDIA-Nemotron-3-Nano-30B-A3B

LLM

Vanilla

NVFP4

8

2,048

1,385.7

1,478.0

Jetson AGX Thor

NVIDIA-Nemotron-3-Nano-4B

LLM

Vanilla

NVFP4

1

2,048

1,152.7

1,776.7

Jetson AGX Thor

NVIDIA-Nemotron-3-Nano-4B

LLM

Vanilla

NVFP4

8

2,048

5,347.0

383.0

Jetson AGX Thor

Qwen3-0.6B

LLM

Vanilla

NVFP4

1

2,048

22.4

91,469.3

Jetson AGX Thor

Qwen3-0.6B

LLM

Vanilla

NVFP4

8

2,048

248.5

8,241.0

Jetson AGX Thor

Qwen3-1.7B

LLM

Vanilla

NVFP4

1

2,048

32.3

63,441.6

Jetson AGX Thor

Qwen3-1.7B

LLM

Vanilla

NVFP4

8

2,048

356.8

5,739.6

Jetson AGX Thor

Qwen3-4B-Instruct-2507

LLM

Vanilla

NVFP4

1

2,048

67.1

30,539.5

Jetson AGX Thor

Qwen3-4B-Instruct-2507

LLM

Vanilla

NVFP4

8

2,048

823.8

2,486.0

Jetson AGX Thor

Qwen3-8B

LLM

Vanilla

NVFP4

1

2,048

111.0

18,444.1

Jetson AGX Thor

Qwen3-8B

LLM

Vanilla

NVFP4

8

2,048

1,245.9

1,643.8

Jetson AGX Thor

Qwen3.5-0.8B-LLM

LLM

Vanilla

NVFP4

1

2,048

36.6

55,904.4

Jetson AGX Thor

Qwen3.5-0.8B-LLM

LLM

Vanilla

NVFP4

8

2,048

349.0

5,868.1

Jetson AGX Thor

Qwen3.5-27B-LLM

LLM

Vanilla

NVFP4

1

2,048

463.3

4,420.5

Jetson AGX Thor

Qwen3.5-27B-LLM

LLM

Vanilla

NVFP4

8

2,048

4,268.6

479.8

Jetson AGX Thor

Qwen3.5-2B-LLM

LLM

Vanilla

NVFP4

1

2,048

46.1

44,458.0

Jetson AGX Thor

Qwen3.5-2B-LLM

LLM

Vanilla

NVFP4

8

2,048

435.6

4,701.8

Jetson AGX Thor

Qwen3.5-4B-LLM

LLM

Vanilla

NVFP4

1

2,048

101.2

20,237.3

Jetson AGX Thor

Qwen3.5-4B-LLM

LLM

Vanilla

NVFP4

8

2,048

997.1

2,053.9

Jetson AGX Thor

Qwen3.5-9B-LLM

LLM

Vanilla

NVFP4

1

2,048

138.4

14,795.8

Jetson AGX Thor

Qwen3.5-9B-LLM

LLM

Vanilla

NVFP4

8

2,048

1,411.1

1,451.3

Jetson AGX Thor

Qwen3.6-35B-A3B-LLM

LLM

Vanilla

NVFP4

1

2,048

273.3

7,494.3

Jetson AGX Thor

Qwen3-VL-2B-Instruct

VLM

Vanilla

NVFP4 / FP16

1

2,048

32.6

62,763.4

Jetson AGX Thor

Qwen3-VL-2B-Instruct

VLM

Vanilla

NVFP4 / FP16

8

2,048

358.0

5,721.4

Jetson AGX Thor

Qwen3-VL-4B-Instruct

VLM

Vanilla

NVFP4 / FP16

1

2,048

68.5

29,908.3

Jetson AGX Thor

Qwen3-VL-4B-Instruct

VLM

Vanilla

NVFP4 / FP16

8

2,048

833.3

2,457.8

Jetson AGX Thor

Qwen3-VL-8B-Instruct

VLM

Vanilla

NVFP4 / FP16

1

2,048

112.2

18,254.6

Jetson AGX Thor

Qwen3-VL-8B-Instruct

VLM

Vanilla

NVFP4 / FP16

8

2,048

1,239.3

1,652.6

Jetson AGX Thor

Qwen3.5-0.8B

VLM

Vanilla

NVFP4 / FP16

1

2,048

36.5

56,042.9

Jetson AGX Thor

Qwen3.5-0.8B

VLM

Vanilla

NVFP4 / FP16

8

2,048

349.2

5,864.1

Jetson AGX Thor

Qwen3.5-27B

VLM

Vanilla

NVFP4 / FP16

1

2,048

459.2

4,460.3

Jetson AGX Thor

Qwen3.5-27B

VLM

Vanilla

NVFP4 / FP16

8

2,048

4,201.7

487.4

Jetson AGX Thor

Qwen3.5-2B

VLM

Vanilla

NVFP4 / FP16

1

2,048

46.0

44,568.0

Jetson AGX Thor

Qwen3.5-2B

VLM

Vanilla

NVFP4 / FP16

8

2,048

438.0

4,676.0

Jetson AGX Thor

Qwen3.5-4B

VLM

Vanilla

NVFP4 / FP16

1

2,048

101.2

20,235.5

Jetson AGX Thor

Qwen3.5-4B

VLM

Vanilla

NVFP4 / FP16

8

2,048

996.5

2,055.2

Jetson AGX Thor

Qwen3.5-9B

VLM

Vanilla

NVFP4 / FP16

1

2,048

138.7

14,766.6

Jetson AGX Thor

Qwen3.5-9B

VLM

Vanilla

NVFP4 / FP16

8

2,048

1,399.9

1,462.9

Jetson AGX Thor

nemotron-omni-ea

VLM

Vanilla

NVFP4 / FP16

1

2,048

223.6

9,157.3

Jetson AGX Thor

nemotron-omni-ea

VLM

Vanilla

NVFP4 / FP16

8

2,048

1,355.9

1,510.4

Jetson AGX Thor

Qwen3-1.7B

LLM

EAGLE

NVFP4 / NVFP4

1

2,048

30.6

66,929.4

Jetson AGX Thor

Qwen3-1.7B

LLM

EAGLE

NVFP4 / NVFP4

8

2,048

356.9

5,738.5

Jetson AGX Thor

Qwen3-8B

LLM

EAGLE

NVFP4 / NVFP4

1

2,048

109.2

18,754.4

Jetson AGX Thor

Qwen3-8B

LLM

EAGLE

NVFP4 / NVFP4

8

2,048

1,241.7

1,649.4

Jetson AGX Thor

Qwen2.5-VL-7B-Instruct

VLM

EAGLE

NVFP4 / NVFP4 / FP16

1

2,048

81.8

25,038.0

Jetson AGX Thor

Qwen2.5-VL-7B-Instruct

VLM

EAGLE

NVFP4 / NVFP4 / FP16

8

2,048

984.0

2,081.3

Jetson AGX Thor

Qwen3-VL-4B-Instruct

VLM

EAGLE

NVFP4 / NVFP4 / FP16

1

2,048

64.6

31,680.9

Jetson AGX Thor

Qwen3-VL-4B-Instruct

VLM

EAGLE

NVFP4 / NVFP4 / FP16

8

2,048

833.8

2,456.2

Jetson AGX Thor

Qwen3.5-0.8B

VLM

MTP

NVFP4 / NVFP4 / FP16

1

2,048

34.9

58,600.2

Jetson AGX Thor

Qwen3.5-0.8B

VLM

MTP

NVFP4 / NVFP4 / FP16

8

2,048

348.8

5,872.0

Jetson AGX Thor

Qwen3.5-27B

VLM

MTP

NVFP4 / NVFP4 / FP16

1

2,048

454.5

4,505.6

Jetson AGX Thor

Qwen3.5-27B

VLM

MTP

NVFP4 / NVFP4 / FP16

8

2,048

4,209.1

486.6

Jetson AGX Thor

Qwen3.5-2B

VLM

MTP

NVFP4 / NVFP4 / FP16

1

2,048

43.0

47,649.3

Jetson AGX Thor

Qwen3.5-2B

VLM

MTP

NVFP4 / NVFP4 / FP16

8

2,048

433.9

4,719.7

Jetson AGX Thor

Qwen3.5-4B

VLM

MTP

NVFP4 / NVFP4 / FP16

1

2,048

99.2

20,648.3

Jetson AGX Thor

Qwen3.5-4B

VLM

MTP

NVFP4 / NVFP4 / FP16

8

2,048

994.7

2,058.9

Jetson AGX Thor

Qwen3.5-9B

VLM

MTP

NVFP4 / NVFP4 / FP16

1

2,048

134.2

15,265.9

Jetson AGX Thor

Qwen3.5-9B

VLM

MTP

NVFP4 / NVFP4 / FP16

8

2,048

1,399.6

1,463.3

Runtime Performance Dashboard#

All v0.8.0 runtime entries below were benchmarked under JetPack 7.2 and are split by device.

Jetson AGX Thor#

Model

Kind

Mode

Precision

Batch

Runtime Prefill (ms)

Runtime Prefill Tok/Run

Runtime Prefill (tok/s)

ViT (ms)

ViT Tok/Run

ViT (tok/s)

Generation (tok/s)

Accept Rate

GPU Mem (MB)

NVIDIA-Nemotron-3-Nano-30B-A3B

LLM

Vanilla

NVFP4

1

104.8

383

3,653.0

-

-

-

72.5

-

19,987

NVIDIA-Nemotron-3-Nano-30B-A3B

LLM

Vanilla

NVFP4

8

508.9

3,062

6,016.5

-

-

-

180.0

-

19,975

NVIDIA-Nemotron-3-Nano-4B

LLM

Vanilla

NVFP4

1

120.8

383

3,169.5

-

-

-

66.4

-

3,538

NVIDIA-Nemotron-3-Nano-4B

LLM

Vanilla

NVFP4

8

933.8

3,062

3,279.1

-

-

-

302.9

-

3,548

Qwen3-0.6B

LLM

Vanilla

INT4 AWQ

1

22.7

370

16,296.0

-

-

-

194.0

-

773

Qwen3-0.6B

LLM

Vanilla

INT4 AWQ

8

238.9

2,959

12,383.9

-

-

-

329.6

-

875

Qwen3-0.6B

LLM

Vanilla

NVFP4

1

13.4

370

27,533.2

-

-

-

192.5

-

957

Qwen3-0.6B

LLM

Vanilla

NVFP4

8

100.5

2,959

29,449.6

-

-

-

359.8

-

998

Qwen3-1.7B

LLM

Vanilla

INT4 AWQ

1

49.8

370

7,423.1

-

-

-

119.7

-

1,067

Qwen3-1.7B

LLM

Vanilla

INT4 AWQ

8

589.9

2,959

5,015.4

-

-

-

276.6

-

1,066

Qwen3-1.7B

LLM

Vanilla

NVFP4

1

18.4

370

20,067.5

-

-

-

118.9

-

1,884

Qwen3-1.7B

LLM

Vanilla

NVFP4

8

142.9

2,959

20,704.1

-

-

-

302.4

-

1,782

Qwen3-30B-A3B

LLM

Vanilla

INT4 GPTQ

1

140.2

370

2,638.4

-

-

-

75.1

-

14,305

Qwen3-30B-A3B

LLM

Vanilla

INT4 GPTQ

8

1,461.4

2,959

2,024.6

-

-

-

161.7

-

14,313

Qwen3-4B-Instruct-2507

LLM

Vanilla

INT4 AWQ

1

111.9

364

3,251.6

-

-

-

66.2

-

1,846

Qwen3-4B-Instruct-2507

LLM

Vanilla

INT4 AWQ

8

1,413.0

2,911

2,060.0

-

-

-

173.7

-

1,813

Qwen3-4B-Instruct-2507

LLM

Vanilla

NVFP4

1

31.5

364

11,568.8

-

-

-

67.1

-

3,168

Qwen3-4B-Instruct-2507

LLM

Vanilla

NVFP4

8

283.9

2,911

10,251.5

-

-

-

216.3

-

3,148

Qwen3-8B

LLM

Vanilla

INT4 AWQ

1

200.3

370

1,846.7

-

-

-

43.7

-

3,195

Qwen3-8B

LLM

Vanilla

INT4 AWQ

8

2,546.9

2,959

1,161.7

-

-

-

126.8

-

3,260

Qwen3-8B

LLM

Vanilla

NVFP4

1

42.8

370

8,640.5

-

-

-

42.3

-

5,356

Qwen3-8B

LLM

Vanilla

NVFP4

8

426.8

2,959

6,933.1

-

-

-

157.4

-

5,376

Qwen3.5-0.8B-LLM

LLM

Vanilla

INT4 AWQ

1

30.7

377

12,271.7

-

-

-

214.5

-

951

Qwen3.5-0.8B-LLM

LLM

Vanilla

INT4 AWQ

8

370.8

3,013

8,126.8

-

-

-

717.8

-

962

Qwen3.5-0.8B-LLM

LLM

Vanilla

NVFP4

1

13.8

377

27,269.9

-

-

-

216.2

-

1,176

Qwen3.5-0.8B-LLM

LLM

Vanilla

NVFP4

8

114.6

3,013

26,298.4

-

-

-

926.5

-

1,172

Qwen3.5-27B-LLM

LLM

Vanilla

INT4 AWQ

1

713.9

377

527.6

-

-

-

15.4

-

8,927

Qwen3.5-27B-LLM

LLM

Vanilla

INT4 AWQ

8

10,102.2

3,013

298.3

-

-

-

54.6

-

8,940

Qwen3.5-27B-LLM

LLM

Vanilla

NVFP4

1

130.5

377

2,887.1

-

-

-

14.3

-

15,990

Qwen3.5-27B-LLM

LLM

Vanilla

NVFP4

8

1,356.3

3,013

2,221.6

-

-

-

72.9

-

16,044

Qwen3.5-2B-LLM

LLM

Vanilla

INT4 AWQ

1

56.6

377

6,650.8

-

-

-

118.8

-

1,443

Qwen3.5-2B-LLM

LLM

Vanilla

INT4 AWQ

8

690.0

3,013

4,366.8

-

-

-

459.4

-

1,446

Qwen3.5-2B-LLM

LLM

Vanilla

NVFP4

1

18.8

377

19,993.5

-

-

-

118.8

-

2,131

Qwen3.5-2B-LLM

LLM

Vanilla

NVFP4

8

148.7

3,013

20,261.3

-

-

-

601.0

-

2,138

Qwen3.5-35B-A3B-LLM

LLM

Vanilla

INT4 GPTQ

1

118.7

377

3,173.7

-

-

-

46.5

-

15,847

Qwen3.5-35B-A3B-LLM

LLM

Vanilla

INT4 GPTQ

8

973.7

3,013

3,094.7

-

-

-

175.0

-

15,840

Qwen3.5-4B-LLM

LLM

Vanilla

INT4 AWQ

1

128.5

377

2,931.7

-

-

-

64.9

-

1,707

Qwen3.5-4B-LLM

LLM

Vanilla

INT4 AWQ

8

1,660.6

3,013

1,814.5

-

-

-

221.4

-

1,702

Qwen3.5-4B-LLM

LLM

Vanilla

NVFP4

1

34.4

377

10,939.3

-

-

-

64.8

-

3,567

Qwen3.5-4B-LLM

LLM

Vanilla

NVFP4

8

305.0

3,013

9,880.3

-

-

-

303.0

-

3,604

Qwen3.5-9B-LLM

LLM

Vanilla

INT4 AWQ

1

219.5

377

1,715.8

-

-

-

41.2

-

2,863

Qwen3.5-9B-LLM

LLM

Vanilla

INT4 AWQ

8

3,031.4

3,013

994.0

-

-

-

144.6

-

2,866

Qwen3.5-9B-LLM

LLM

Vanilla

NVFP4

1

46.4

377

8,112.3

-

-

-

39.4

-

6,110

Qwen3.5-9B-LLM

LLM

Vanilla

NVFP4

8

449.7

3,013

6,700.4

-

-

-

181.2

-

6,108

Qwen3.6-35B-A3B-LLM

LLM

Vanilla

NVFP4

1

144.2

377

2,611.8

-

-

-

69.0

-

21,442

Qwen3.6-35B-A3B-LLM

LLM

Vanilla

NVFP4

8

568.7

3,013

5,298.1

-

-

-

200.2

-

21,490

Qwen3-VL-2B-Instruct

VLM

Vanilla

INT4 AWQ / FP16

1

38.2

283

7,399.3

12.6

263

20,790.0

119.0

-

1,422

Qwen3-VL-2B-Instruct

VLM

Vanilla

INT4 AWQ / FP16

8

321.0

2,196

6,840.7

85.9

2,036

23,696.7

273.6

-

1,409

Qwen3-VL-2B-Instruct

VLM

Vanilla

NVFP4 / FP16

1

14.8

283

19,152.2

12.6

262

20,790.0

117.8

-

1,867

Qwen3-VL-2B-Instruct

VLM

Vanilla

NVFP4 / FP16

8

92.1

2,196

23,828.6

85.6

2,039

23,809.5

309.2

-

1,910

Qwen3-VL-4B-Instruct

VLM

Vanilla

INT4 AWQ / FP16

1

89.2

283

3,169.7

13.0

262

20,161.3

67.0

-

1,899

Qwen3-VL-4B-Instruct

VLM

Vanilla

INT4 AWQ / FP16

8

775.3

2,196

2,832.1

85.8

2,039

23,753.0

164.2

-

1,914

Qwen3-VL-4B-Instruct

VLM

Vanilla

NVFP4 / FP16

1

28.4

283

9,963.3

12.7

262

20,746.9

66.6

-

3,228

Qwen3-VL-4B-Instruct

VLM

Vanilla

NVFP4 / FP16

8

186.2

2,196

11,791.3

85.5

2,036

23,809.5

181.1

-

3,242

Qwen3-VL-8B-Instruct

VLM

Vanilla

INT4 AWQ / FP16

1

155.8

283

1,813.9

17.3

263

15,174.5

44.1

-

3,283

Qwen3-VL-8B-Instruct

VLM

Vanilla

INT4 AWQ / FP16

8

1,354.5

2,196

1,621.0

119.6

2,037

17,035.8

127.3

-

3,243

Qwen3-VL-8B-Instruct

VLM

Vanilla

NVFP4 / FP16

1

38.4

283

7,354.1

17.3

262

15,151.5

41.9

-

5,484

Qwen3-VL-8B-Instruct

VLM

Vanilla

NVFP4 / FP16

8

260.7

2,196

8,422.7

119.5

2,037

17,035.8

150.5

-

5,482

Qwen3.5-0.8B

VLM

Vanilla

INT4 AWQ / FP16

1

21.4

287

13,382.7

4.3

262

60,975.6

209.8

-

1,045

Qwen3.5-0.8B

VLM

Vanilla

INT4 AWQ / FP16

8

191.0

2,227

11,660.9

30.6

2,041

66,666.7

593.7

-

1,085

Qwen3.5-0.8B

VLM

Vanilla

NVFP4 / FP16

1

10.0

287

28,729.9

4.3

262

60,975.6

210.9

-

1,260

Qwen3.5-0.8B

VLM

Vanilla

NVFP4 / FP16

8

63.6

2,227

34,989.3

30.6

2,042

66,666.7

758.0

-

1,268

Qwen3.5-27B

VLM

Vanilla

INT4 AWQ / FP16

1

558.3

287

513.4

16.2

263

16,181.2

15.5

-

9,128

Qwen3.5-27B

VLM

Vanilla

INT4 AWQ / FP16

8

5,186.5

2,227

429.4

114.6

2,039

17,793.6

54.7

-

9,011

Qwen3.5-27B

VLM

Vanilla

NVFP4 / FP16

1

103.6

287

2,765.5

16.1

262

16,286.6

14.3

-

16,087

Qwen3.5-27B

VLM

Vanilla

NVFP4 / FP16

8

744.7

2,227

2,990.2

114.4

2,040

17,825.3

74.3

-

16,119

Qwen3.5-2B

VLM

Vanilla

INT4 AWQ / FP16

1

41.8

287

6,853.9

11.7

262

22,321.4

115.9

-

1,562

Qwen3.5-2B

VLM

Vanilla

INT4 AWQ / FP16

8

354.2

2,227

6,286.8

83.3

2,037

24,449.9

419.2

-

1,639

Qwen3.5-2B

VLM

Vanilla

NVFP4 / FP16

1

14.3

287

20,005.7

11.7

262

22,471.9

117.7

-

2,213

Qwen3.5-2B

VLM

Vanilla

NVFP4 / FP16

8

79.1

2,227

28,157.1

82.1

2,037

24,813.9

450.7

-

2,228

Qwen3.5-35B-A3B

VLM

Vanilla

INT4 GPTQ / FP16

1

109.8

287

2,609.4

16.0

262

16,366.6

46.2

-

15,942

Qwen3.5-35B-A3B

VLM

Vanilla

INT4 GPTQ / FP16

8

522.2

2,227

4,264.3

113.4

2,040

17,985.6

181.6

-

15,944

Qwen3.5-4B

VLM

Vanilla

INT4 AWQ / FP16

1

97.2

287

2,948.2

12.0

262

21,881.8

65.0

-

1,788

Qwen3.5-4B

VLM

Vanilla

INT4 AWQ / FP16

8

856.1

2,227

2,601.2

82.7

2,036

24,630.5

218.1

-

1,787

Qwen3.5-4B

VLM

Vanilla

NVFP4 / FP16

1

26.6

287

10,781.8

11.7

262

22,371.4

64.4

-

3,678

Qwen3.5-4B

VLM

Vanilla

NVFP4 / FP16

8

166.8

2,227

13,348.1

82.0

2,040

24,875.6

287.2

-

3,669

Qwen3.5-9B

VLM

Vanilla

INT4 AWQ / FP16

1

170.7

287

1,678.6

16.1

262

16,286.6

41.1

-

2,953

Qwen3.5-9B

VLM

Vanilla

INT4 AWQ / FP16

8

1,560.5

2,227

1,427.0

113.2

2,037

17,985.6

145.7

-

2,953

Qwen3.5-9B

VLM

Vanilla

NVFP4 / FP16

1

38.2

287

7,512.9

16.0

262

16,366.6

39.4

-

6,178

Qwen3.5-9B

VLM

Vanilla

NVFP4 / FP16

8

243.6

2,227

9,140.3

113.2

2,040

18,018.0

183.2

-

6,207

Qwen3.6-35B-A3B

VLM

Vanilla

NVFP4 / FP16

1

144.8

287

1,979.8

16.1

262

16,313.2

68.5

-

21,607

Qwen3.6-35B-A3B

VLM

Vanilla

NVFP4 / FP16

8

350.8

2,227

6,347.9

113.2

2,039

18,018.0

199.3

-

21,569

nemotron-omni-ea

VLM

Vanilla

NVFP4 / FP16

1

209.7

1,663

7,932.3

126.0

1,634

12,970.2

67.5

-

20,259

nemotron-omni-ea

VLM

Vanilla

NVFP4 / FP16

8

1,182.8

12,922

10,925.0

982.5

12,694

12,919.9

106.9

-

20,257

Qwen3-1.7B

LLM

EAGLE

INT4 AWQ / INT4 AWQ

1

49.5

370

7,464.8

-

-

-

182.9

3.89

1,067

Qwen3-1.7B

LLM

EAGLE

INT4 AWQ / INT4 AWQ

8

590.6

2,959

5,010.0

-

-

-

296.5

3.87

1,105

Qwen3-1.7B

LLM

EAGLE

NVFP4 / NVFP4

1

16.0

370

23,065.1

-

-

-

284.9

3.84

1,345

Qwen3-1.7B

LLM

EAGLE

NVFP4 / NVFP4

8

141.6

2,959

20,900.3

-

-

-

618.0

3.81

1,371

Qwen3-8B

LLM

EAGLE

INT4 AWQ / INT4 AWQ

1

199.9

370

1,849.8

-

-

-

70.0

4.15

3,156

Qwen3-8B

LLM

EAGLE

INT4 AWQ / INT4 AWQ

8

2,550.1

2,959

1,160.3

-

-

-

95.7

4.11

3,174

Qwen3-8B

LLM

EAGLE

NVFP4 / NVFP4

1

39.3

370

9,421.5

-

-

-

135.2

4.06

4,499

Qwen3-8B

LLM

EAGLE

NVFP4 / NVFP4

8

428.2

2,959

6,910.2

-

-

-

341.0

4.05

4,550

Qwen2.5-VL-7B-Instruct

VLM

EAGLE

INT4 AWQ / INT4 AWQ / FP16

1

180.9

376

2,076.4

52.5

344

6,561.7

84.4

5.15

3,406

Qwen2.5-VL-7B-Instruct

VLM

EAGLE

INT4 AWQ / INT4 AWQ / FP16

8

1,613.3

2,919

1,809.0

429.4

2,675

6,230.5

116.7

5.07

3,371

Qwen2.5-VL-7B-Instruct

VLM

EAGLE

NVFP4 / NVFP4 / FP16

1

28.6

376

13,119.4

52.5

344

6,557.4

189.6

5.09

4,308

Qwen2.5-VL-7B-Instruct

VLM

EAGLE

NVFP4 / NVFP4 / FP16

8

258.9

2,919

11,271.9

431.0

2,675

6,207.3

383.8

4.90

4,319

Qwen3-VL-4B-Instruct

VLM

EAGLE

INT4 AWQ / INT4 AWQ / FP16

1

89.0

283

3,174.8

12.9

262

20,242.9

126.5

5.02

1,900

Qwen3-VL-4B-Instruct

VLM

EAGLE

INT4 AWQ / INT4 AWQ / FP16

8

774.9

2,196

2,833.6

85.8

2,038

23,753.0

187.9

4.96

1,894

Qwen3-VL-4B-Instruct

VLM

EAGLE

NVFP4 / NVFP4 / FP16

1

26.2

283

10,789.2

12.6

262

20,833.3

199.7

4.86

2,685

Qwen3-VL-4B-Instruct

VLM

EAGLE

NVFP4 / NVFP4 / FP16

8

186.4

2,196

11,778.3

85.4

2,039

23,866.3

418.8

4.87

2,700

Qwen3-VL-8B-Instruct

VLM

EAGLE

INT4 AWQ / INT4 AWQ / FP16

1

156.2

283

1,809.5

17.3

262

15,151.5

48.2

2.85

3,252

Qwen3-VL-8B-Instruct

VLM

EAGLE

INT4 AWQ / INT4 AWQ / FP16

8

1,354.6

2,196

1,621.0

120.4

2,037

16,920.5

65.7

2.82

3,251

Qwen3.5-0.8B

VLM

MTP

INT4 AWQ / INT4 AWQ / FP16

1

21.2

287

13,492.0

4.3

262

60,606.1

200.3

2.18

1,179

Qwen3.5-0.8B

VLM

MTP

INT4 AWQ / INT4 AWQ / FP16

8

234.5

2,227

9,496.0

30.7

2,032

66,225.2

517.8

2.17

1,139

Qwen3.5-0.8B

VLM

MTP

NVFP4 / NVFP4 / FP16

1

8.4

287

34,076.3

4.3

261

60,241.0

365.6

2.19

1,069

Qwen3.5-0.8B

VLM

MTP

NVFP4 / NVFP4 / FP16

8

59.2

2,227

37,589.6

30.6

2,042

66,666.7

979.9

2.15

1,191

Qwen3.5-27B

VLM

MTP

INT4 AWQ / INT4 AWQ / FP16

1

558.0

287

513.6

16.2

262

16,233.8

27.9

2.89

9,022

Qwen3.5-27B

VLM

MTP

INT4 AWQ / INT4 AWQ / FP16

8

5,188.0

2,227

429.2

114.3

2,037

17,825.3

92.6

2.89

9,062

Qwen3.5-27B

VLM

MTP

NVFP4 / NVFP4 / FP16

1

96.5

287

2,970.2

16.1

263

16,286.6

36.1

2.82

14,342

Qwen3.5-27B

VLM

MTP

NVFP4 / NVFP4 / FP16

8

736.8

2,227

3,022.2

114.6

2,040

17,793.6

140.6

2.83

14,387

Qwen3.5-2B

VLM

MTP

INT4 AWQ / INT4 AWQ / FP16

1

41.6

287

6,886.3

12.0

262

21,929.8

123.6

2.39

1,623

Qwen3.5-2B

VLM

MTP

INT4 AWQ / INT4 AWQ / FP16

8

412.4

2,227

5,399.3

82.0

2,036

24,813.9

384.0

2.40

1,629

Qwen3.5-2B

VLM

MTP

NVFP4 / NVFP4 / FP16

1

11.5

287

24,978.3

11.8

262

22,222.2

246.6

2.44

1,559

Qwen3.5-2B

VLM

MTP

NVFP4 / NVFP4 / FP16

8

76.8

2,227

28,982.7

82.6

2,040

24,691.4

718.4

2.37

1,564

Qwen3.5-4B

VLM

MTP

INT4 AWQ / INT4 AWQ / FP16

1

97.3

287

2,945.2

12.1

263

21,786.5

79.9

2.53

1,914

Qwen3.5-4B

VLM

MTP

INT4 AWQ / INT4 AWQ / FP16

8

970.9

2,227

2,293.5

82.9

2,037

24,570.0

257.7

2.52

1,890

Qwen3.5-4B

VLM

MTP

NVFP4 / NVFP4 / FP16

1

23.4

287

12,229.4

11.7

262

22,471.9

135.7

2.54

2,797

Qwen3.5-4B

VLM

MTP

NVFP4 / NVFP4 / FP16

8

163.0

2,227

13,665.8

82.9

2,037

24,570.0

408.9

2.54

2,816

Qwen3.5-9B

VLM

MTP

INT4 AWQ / INT4 AWQ / FP16

1

171.0

287

1,676.1

16.1

262

16,286.6

57.5

2.78

2,945

Qwen3.5-9B

VLM

MTP

INT4 AWQ / INT4 AWQ / FP16

8

1,560.3

2,227

1,427.2

113.7

2,037

17,921.1

202.6

2.80

2,952

Qwen3.5-9B

VLM

MTP

NVFP4 / NVFP4 / FP16

1

32.7

287

8,760.5

16.1

262

16,286.6

93.0

2.71

4,783

Qwen3.5-9B

VLM

MTP

NVFP4 / NVFP4 / FP16

8

235.4

2,227

9,458.1

113.6

2,039

17,953.3

329.5

2.72

4,842

Jetson AGX Orin 64GB#

Model

Kind

Mode

Precision

Batch

Runtime Prefill (ms)

Runtime Prefill Tok/Run

Runtime Prefill (tok/s)

ViT (ms)

ViT Tok/Run

ViT (tok/s)

Generation (tok/s)

Accept Rate

GPU Mem (MB)

Qwen3-0.6B

LLM

Vanilla

INT4 AWQ

1

27.3

370

13,557.9

-

-

-

177.2

-

1,968

Qwen3-0.6B

LLM

Vanilla

INT4 AWQ

8

275.9

2,959

10,723.5

-

-

-

584.0

-

4,188

Qwen3-1.7B

LLM

Vanilla

INT4 AWQ

1

63.5

370

5,822.7

-

-

-

95.5

-

3,292

Qwen3-1.7B

LLM

Vanilla

INT4 AWQ

8

787.5

2,959

3,757.3

-

-

-

406.7

-

5,715

Qwen3-30B-A3B

LLM

Vanilla

INT4 GPTQ

1

206.4

370

1,792.0

-

-

-

55.2

-

30,025

Qwen3-30B-A3B

LLM

Vanilla

INT4 GPTQ

8

2,094.1

2,959

1,412.9

-

-

-

154.2

-

31,613

Qwen3-4B-Instruct-2507

LLM

Vanilla

INT4 AWQ

1

150.6

364

2,416.5

-

-

-

52.0

-

4,933

Qwen3-4B-Instruct-2507

LLM

Vanilla

INT4 AWQ

8

1,952.8

2,911

1,490.6

-

-

-

211.4

-

8,194

Qwen3-8B

LLM

Vanilla

INT4 AWQ

1

273.9

370

1,350.5

-

-

-

32.1

-

8,237

Qwen3-8B

LLM

Vanilla

INT4 AWQ

8

3,561.2

2,959

830.8

-

-

-

139.0

-

11,229

Qwen3.5-0.8B-LLM

LLM

Vanilla

INT4 AWQ

1

124.2

377

3,033.3

-

-

-

145.7

-

2,329

Qwen3.5-0.8B-LLM

LLM

Vanilla

INT4 AWQ

8

939.4

3,013

3,207.4

-

-

-

633.3

-

3,333

Qwen3.5-27B-LLM

LLM

Vanilla

INT4 AWQ

1

1,335.8

377

282.0

-

-

-

10.5

-

24,136

Qwen3.5-27B-LLM

LLM

Vanilla

INT4 AWQ

8

16,963.3

3,013

177.6

-

-

-

44.5

-

26,293

Qwen3.5-2B-LLM

LLM

Vanilla

INT4 AWQ

1

161.9

377

2,326.5

-

-

-

81.6

-

4,197

Qwen3.5-2B-LLM

LLM

Vanilla

INT4 AWQ

8

1,447.5

3,013

2,081.7

-

-

-

393.6

-

5,117

Qwen3.5-35B-A3B-LLM

LLM

Vanilla

INT4 GPTQ

1

340.7

377

1,105.4

-

-

-

31.0

-

35,742

Qwen3.5-35B-A3B-LLM

LLM

Vanilla

INT4 GPTQ

8

2,807.2

3,013

1,073.4

-

-

-

128.5

-

36,654

Qwen3.5-4B-LLM

LLM

Vanilla

INT4 AWQ

1

307.0

377

1,226.9

-

-

-

45.4

-

6,147

Qwen3.5-4B-LLM

LLM

Vanilla

INT4 AWQ

8

3,428.1

3,013

879.0

-

-

-

187.7

-

7,659

Qwen3.5-9B-LLM

LLM

Vanilla

INT4 AWQ

1

437.5

377

860.9

-

-

-

27.9

-

9,983

Qwen3.5-9B-LLM

LLM

Vanilla

INT4 AWQ

8

5,122.9

3,013

588.2

-

-

-

125.8

-

11,278

Qwen3-VL-2B-Instruct

VLM

Vanilla

INT4 AWQ / FP16

1

48.0

283

5,883.5

38.7

262

6,775.1

95.7

-

8,235

Qwen3-VL-2B-Instruct

VLM

Vanilla

INT4 AWQ / FP16

8

408.3

2,196

5,378.2

276.1

2,039

7,385.5

388.8

-

10,304

Qwen3-VL-4B-Instruct

VLM

Vanilla

INT4 AWQ / FP16

1

118.6

283

2,384.0

39.0

262

6,724.9

52.2

-

9,863

Qwen3-VL-4B-Instruct

VLM

Vanilla

INT4 AWQ / FP16

8

1,024.0

2,196

2,144.3

277.4

2,038

7,347.5

176.4

-

12,511

Qwen3-VL-8B-Instruct

VLM

Vanilla

INT4 AWQ / FP16

1

213.2

283

1,325.8

56.5

262

4,646.8

32.2

-

13,133

Qwen3-VL-8B-Instruct

VLM

Vanilla

INT4 AWQ / FP16

8

1,842.2

2,196

1,191.9

398.8

2,039

5,112.5

141.0

-

16,014

Qwen3.5-0.8B

VLM

Vanilla

INT4 AWQ / FP16

1

94.0

287

3,050.3

13.0

262

20,161.3

141.4

-

4,549

Qwen3.5-0.8B

VLM

Vanilla

INT4 AWQ / FP16

8

501.1

2,227

4,444.2

89.9

2,038

22,675.7

451.8

-

5,058

Qwen3.5-27B

VLM

Vanilla

INT4 AWQ / FP16

1

1,037.4

287

276.3

53.5

262

4,906.8

10.5

-

24,048

Qwen3.5-27B

VLM

Vanilla

INT4 AWQ / FP16

8

8,777.6

2,227

253.7

382.1

2,038

5,333.3

44.2

-

26,372

Qwen3.5-2B

VLM

Vanilla

INT4 AWQ / FP16

1

120.8

287

2,372.0

36.6

262

7,168.5

81.0

-

6,886

Qwen3.5-2B

VLM

Vanilla

INT4 AWQ / FP16

8

766.8

2,227

2,903.9

264.7

2,038

7,698.2

303.9

-

7,456

Qwen3.5-35B-A3B

VLM

Vanilla

INT4 GPTQ / FP16

1

297.1

287

964.7

52.8

262

4,970.2

30.8

-

35,810

Qwen3.5-35B-A3B

VLM

Vanilla

INT4 GPTQ / FP16

8

1,537.3

2,227

1,448.6

379.6

2,038

5,367.7

131.4

-

36,733

Qwen3.5-4B

VLM

Vanilla

INT4 AWQ / FP16

1

234.9

287

1,220.0

36.6

262

7,158.2

45.3

-

8,638

Qwen3.5-4B

VLM

Vanilla

INT4 AWQ / FP16

8

1,815.4

2,227

1,226.7

265.5

2,038

7,674.6

174.8

-

9,798

Qwen3.5-9B

VLM

Vanilla

INT4 AWQ / FP16

1

341.5

287

839.2

53.2

262

4,931.0

27.9

-

12,247

Qwen3.5-9B

VLM

Vanilla

INT4 AWQ / FP16

8

2,672.0

2,227

833.4

380.9

2,038

5,350.5

121.7

-

13,410

Qwen3-1.7B

LLM

EAGLE

INT4 AWQ / INT4 AWQ

1

63.8

370

5,797.2

-

-

-

152.4

3.87

3,362

Qwen3-1.7B

LLM

EAGLE

INT4 AWQ / INT4 AWQ

8

784.4

2,959

3,772.0

-

-

-

250.7

3.85

6,949

Qwen3-8B

LLM

EAGLE

INT4 AWQ / INT4 AWQ

1

274.3

370

1,348.6

-

-

-

57.8

4.18

8,349

Qwen3-8B

LLM

EAGLE

INT4 AWQ / INT4 AWQ

8

3,567.7

2,959

829.3

-

-

-

70.4

4.16

12,969

Qwen2.5-VL-7B-Instruct

VLM

EAGLE

INT4 AWQ / INT4 AWQ / FP16

1

255.2

376

1,472.3

86.1

344

3,998.4

67.2

5.05

12,218

Qwen2.5-VL-7B-Instruct

VLM

EAGLE

INT4 AWQ / INT4 AWQ / FP16

8

2,279.8

2,919

1,280.2

671.2

2,675

3,985.7

82.5

4.97

14,145

Qwen3-VL-4B-Instruct

VLM

EAGLE

INT4 AWQ / INT4 AWQ / FP16

1

118.0

283

2,396.0

39.1

262

6,711.4

98.3

5.05

10,310

Qwen3-VL-4B-Instruct

VLM

EAGLE

INT4 AWQ / INT4 AWQ / FP16

8

1,025.0

2,196

2,142.2

276.8

2,038

7,363.8

147.9

5.07

13,669

Qwen3-VL-8B-Instruct

VLM

EAGLE

INT4 AWQ / INT4 AWQ / FP16

1

213.6

283

1,323.0

56.6

262

4,633.9

39.1

2.81

13,932

Qwen3-VL-8B-Instruct

VLM

EAGLE

INT4 AWQ / INT4 AWQ / FP16

8

1,843.2

2,196

1,191.3

399.6

2,038

5,099.4

48.1

2.83

17,413

Qwen3.5-0.8B

VLM

MTP

INT4 AWQ / INT4 AWQ / FP16

1

92.7

287

3,092.1

13.0

262

20,242.9

138.9

2.19

5,183

Qwen3.5-0.8B

VLM

MTP

INT4 AWQ / INT4 AWQ / FP16

8

493.8

2,227

4,509.3

90.1

2,038

22,624.4

413.7

2.20

6,396

Qwen3.5-27B

VLM

MTP

INT4 AWQ / INT4 AWQ / FP16

1

1,038.2

287

276.1

53.2

262

4,931.0

18.3

2.89

25,664

Qwen3.5-27B

VLM

MTP

INT4 AWQ / INT4 AWQ / FP16

8

8,790.5

2,227

253.3

383.0

2,038

5,322.0

69.5

2.87

33,332

Qwen3.5-2B

VLM

MTP

INT4 AWQ / INT4 AWQ / FP16

1

120.4

287

2,381.6

36.6

262

7,173.6

86.4

2.41

8,027

Qwen3.5-2B

VLM

MTP

INT4 AWQ / INT4 AWQ / FP16

8

762.3

2,227

2,921.4

265.6

2,038

7,674.6

293.7

2.39

9,382

Qwen3.5-4B

VLM

MTP

INT4 AWQ / INT4 AWQ / FP16

1

234.7

287

1,221.4

36.7

262

7,142.9

54.6

2.53

10,181

Qwen3.5-4B

VLM

MTP

INT4 AWQ / INT4 AWQ / FP16

8

1,811.5

2,227

1,229.3

265.9

2,037

7,662.8

204.4

2.50

12,976

Qwen3.5-9B

VLM

MTP

INT4 AWQ / INT4 AWQ / FP16

1

341.6

287

839.1

53.0

262

4,952.9

40.4

2.82

14,598

Qwen3.5-9B

VLM

MTP

INT4 AWQ / INT4 AWQ / FP16

8

2,670.7

2,227

833.8

379.2

2,038

5,373.5

159.9

2.78

17,548

Jetson Orin NX 16GB#

Model

Kind

Mode

Precision

Batch

Runtime Prefill (ms)

Runtime Prefill Tok/Run

Runtime Prefill (tok/s)

ViT (ms)

ViT Tok/Run

ViT (tok/s)

Generation (tok/s)

Accept Rate

GPU Mem (MB)

Qwen3-0.6B

LLM

Vanilla

INT4 AWQ

1

52.4

370

7,064.6

-

-

-

110.8

-

2,027

Qwen3-1.7B

LLM

Vanilla

INT4 AWQ

1

127.4

370

2,903.3

-

-

-

58.8

-

3,282

Qwen3-4B-Instruct-2507

LLM

Vanilla

INT4 AWQ

1

306.8

364

1,186.0

-

-

-

30.4

-

4,914

Qwen3-8B

LLM

Vanilla

INT4 AWQ

1

615.1

370

601.3

-

-

-

18.6

-

8,210

Qwen3.5-0.8B-LLM

LLM

Vanilla

INT4 AWQ

1

174.8

377

2,154.5

-

-

-

88.2

-

2,286

Qwen3.5-2B-LLM

LLM

Vanilla

INT4 AWQ

1

249.4

377

1,510.3

-

-

-

48.3

-

4,235

Qwen3.5-4B-LLM

LLM

Vanilla

INT4 AWQ

1

535.8

377

703.0

-

-

-

26.2

-

6,093

Qwen3-VL-2B-Instruct

VLM

Vanilla

INT4 AWQ / FP16

1

98.6

283

2,866.0

81.6

262

3,214.4

58.9

-

4,406

Qwen3-VL-4B-Instruct

VLM

Vanilla

INT4 AWQ / FP16

1

238.5

283

1,184.8

83.3

262

3,148.6

30.8

-

5,990

Qwen3-VL-8B-Instruct

VLM

Vanilla

INT4 AWQ / FP16

1

455.2

283

620.9

119.6

262

2,193.0

18.8

-

9,092

Qwen3.5-0.8B

VLM

Vanilla

INT4 AWQ / FP16

1

136.6

287

2,099.0

26.7

262

9,832.8

87.6

-

2,704

Qwen3.5-2B

VLM

Vanilla

INT4 AWQ / FP16

1

193.6

287

1,480.4

77.5

262

3,386.4

48.0

-

4,704

Qwen3.5-4B

VLM

Vanilla

INT4 AWQ / FP16

1

414.6

287

691.3

79.0

262

3,321.2

25.9

-

6,372

Qwen3-1.7B

LLM

EAGLE

INT4 AWQ / INT4 AWQ

1

129.0

370

2,867.6

-

-

-

86.5

3.87

3,394

Qwen3-8B

LLM

EAGLE

INT4 AWQ / INT4 AWQ

1

651.4

370

567.8

-

-

-

29.7

4.14

8,358

Qwen2.5-VL-7B-Instruct

VLM

EAGLE

INT4 AWQ / INT4 AWQ / FP16

1

505.7

376

742.9

200.7

344

1,715.6

34.7

5.05

9,208

Qwen3-VL-4B-Instruct

VLM

EAGLE

INT4 AWQ / INT4 AWQ / FP16

1

237.8

283

1,188.7

83.1

262

3,156.6

56.2

5.11

6,427

Qwen3-VL-8B-Instruct

VLM

EAGLE

INT4 AWQ / INT4 AWQ / FP16

1

488.9

283

578.0

121.0

262

2,167.8

20.4

2.84

9,565

Qwen3.5-0.8B

VLM

MTP

INT4 AWQ / INT4 AWQ / FP16

1

135.8

287

2,111.0

26.7

262

9,813.5

77.7

2.16

3,296

Qwen3.5-2B

VLM

MTP

INT4 AWQ / INT4 AWQ / FP16

1

193.3

287

1,482.8

77.8

262

3,372.7

46.8

2.38

5,758

Qwen3.5-4B

VLM

MTP

INT4 AWQ / INT4 AWQ / FP16

1

410.7

287

697.8

78.0

262

3,363.6

28.9

2.52

7,838

Jetson Orin Nano 8GB#

Model

Kind

Mode

Precision

Batch

Runtime Prefill (ms)

Runtime Prefill Tok/Run

Runtime Prefill (tok/s)

ViT (ms)

ViT Tok/Run

ViT (tok/s)

Generation (tok/s)

Accept Rate

GPU Mem (MB)

Qwen3-0.6B

LLM

Vanilla

INT4 AWQ

1

94.7

370

3,904.2

-

-

-

69.4

-

1,999

Qwen3-1.7B

LLM

Vanilla

INT4 AWQ

1

231.5

370

1,597.9

-

-

-

36.3

-

3,286

Qwen3.5-0.8B-LLM

LLM

Vanilla

INT4 AWQ

1

316.3

377

1,190.9

-

-

-

55.0

-

2,294

Qwen3.5-2B-LLM

LLM

Vanilla

INT4 AWQ

1

445.0

377

846.5

-

-

-

29.8

-

4,145

Qwen3-VL-2B-Instruct

VLM

Vanilla

INT4 AWQ / FP16

1

181.7

283

1,555.1

150.6

262

1,741.6

36.3

-

4,444

Qwen3.5-0.8B

VLM

Vanilla

INT4 AWQ / FP16

1

244.0

287

1,174.8

48.8

262

5,376.3

54.4

-

2,656

Qwen3.5-2B

VLM

Vanilla

INT4 AWQ / FP16

1

349.8

287

819.3

142.4

262

1,842.6

29.4

-

4,649

Qwen3-1.7B

LLM

EAGLE

INT4 AWQ / INT4 AWQ

1

232.2

370

1,593.0

-

-

-

48.7

3.87

3,398

Qwen3.5-0.8B

VLM

MTP

INT4 AWQ / INT4 AWQ / FP16

1

244.1

287

1,174.1

48.7

262

5,387.9

46.9

2.16

3,278


v0.7.1 Results#

SDK Version: TensorRT Edge-LLM 0.7.1  |  TensorRT: 10.13.3.9

LLM — Vanilla Decoding#

Model

Precision

Batch

Prefill (ms)

Prefill Tokens

Prefill (tok/s)

Generation (tok/s)

GPU Mem (MB)

Qwen3-1.7B

NVFP4

1

11.2

370

32,990

173.2

1,531

Qwen3-1.7B

NVFP4

8

132.5

2,959

22,324

935.8

1,475

Qwen3-30B-A3B-GPTQ-Int4

INT4 GPTQ

1

133.2

370

2,777

72.6

15,916

Qwen3-30B-A3B-GPTQ-Int4

INT4 GPTQ

8

1,344.5

2,959

2,201

215.9

15,894

Nemotron-3-Nano-4B

NVFP4

1

127.4

383

3,004

64.7

3,568

Nemotron-3-Nano-4B

NVFP4

8

986.6

3,062

3,104

312.9

3,592

Vision Language Model — Vanilla Decoding#

Model

LLM Prec

ViT Prec

Batch

Prefill (ms)

Prefill Tokens

Prefill (tok/s)

ViT Time (ms)

ViT Tok/Run

ViT (tok/s)

Generation (tok/s)

GPU Mem (MB)

Qwen2.5-VL-7B-Instruct

NVFP4

FP16

1

31.0

376

12,124

27.8

344

12,392

49.5

5,224

Qwen3.5-0.8B

NVFP4

FP16

1

9.4

287

30,410

4.3

262

60,606

287.0

1,192

Qwen3.5-2B

NVFP4

FP16

1

12.7

287

22,550

10.5

262

25,000

164.4

1,694

Qwen3.5-27B

NVFP4

FP16

1

103.3

287

2,775

15.0

262

17,483

16.1

14,725

Nemotron-3-Nano-Omni-30B-A3B

NVFP4

FP16

1

226.0

1,663

7,358

121.3

1,635

13,477

31.3

20,327

LLM — EAGLE Speculative Decoding#

Draft Models#

Base Model

Draft Model

Source

Qwen3-1.7B

Qwen3-1.7B_eagle3

AngelSlim/Qwen3-1.7B_eagle3

Note: Both base and draft models are quantized to NVFP4.

Model

Base Prec

Draft Prec

Batch

Prefill (ms)

Prefill Tokens

Generation (tok/s)

Accept Rate

Speedup

GPU Mem (MB)

Qwen3-1.7B

NVFP4

NVFP4

1

12.2

370

339.0

3.7

1.96x

1,534

Qwen3-1.7B

NVFP4

NVFP4

8

132.3

2,959

984.9

3.7

1.05x

1,466

Vision Language Model — MTP Speculative Decoding#

Note: MTP uses the model’s built-in draft heads; no external draft checkpoint is required. Highlight: MTP is the main v0.7.1 performance improvement, increasing Qwen3.5 VLM BS=1 generation throughput by 1.21x to 2.12x over vanilla decoding.

Model

Base Prec

Draft Prec

ViT Prec

Batch

Prefill (ms)

Prefill Tokens

ViT Time (ms)

ViT Tok/Run

ViT (tok/s)

Generation (tok/s)

Accept Rate

GPU Mem (MB)

Qwen3.5-0.8B

NVFP4

NVFP4

FP16

1

9.2

287

4.2

263

62,500

348.5

2.1

1,210

Qwen3.5-0.8B

NVFP4

NVFP4

FP16

8

69.0

2,227

27.6

2,042

74,074

1,056.7

2.2

1,375

Qwen3.5-2B

NVFP4

NVFP4

FP16

1

13.8

287

10.9

262

24,096

236.9

2.4

1,662

Qwen3.5-2B

NVFP4

NVFP4

FP16

8

89.7

2,227

75.1

2,040

27,174

787.2

2.4

1,647

Qwen3.5-27B

NVFP4

NVFP4

FP16

1

111.1

287

14.4

262

18,215

34.2

2.8

14,680

Qwen3.5-27B

NVFP4

NVFP4

FP16

8

811.0

2,227

108.2

2,038

18,832

146.7

2.8

14,705


v0.7.0 Results#

SDK Version: TensorRT Edge-LLM 0.7.0  |  TensorRT: 10.13

LLM — Vanilla Decoding#

Model

Precision

Batch

Prefill (ms)

Prefill Tokens

Prefill (tok/s)

Generation (tok/s)

GPU Mem (MB)

Qwen3-1.7B

NVFP4

1

13.9

370

26,683

170.4

1,453

Qwen3-1.7B

NVFP4

8

150.5

2,959

19,663

798.8

1,491

Qwen3-30B-A3B-GPTQ-Int4

INT4 GPTQ

1

125.3

370

2,951

81.3

15,938

Qwen3-30B-A3B-GPTQ-Int4

INT4 GPTQ

8

1,342.2

2,959

2,204

223.2

15,961

Nemotron-3-Nano-4B

NVFP4

1

126.8

383

3,018

65.4

3,647

Nemotron-3-Nano-4B

NVFP4

8

1,017.6

3,062

3,009

315.4

3,684

Vision Language Model — Vanilla Decoding#

Model

LLM Prec

ViT Prec

Prefill (ms)

Prefill Tokens

Prefill (tok/s)

Generation (tok/s)

GPU Mem (MB)

Qwen3.5-0.8B

NVFP4

FP16

7.0

753

107,571

232.2

1,052

Qwen3.5-2B

NVFP4

FP16

13.8

753

54,565

111.0

1,671

Qwen3.5-27B

NVFP4

FP16

122.6

753

6,143

10.5

14,985

Nemotron-3-Nano-Omni-30B-A3B

NVFP4

FP16

846.7

1,663

1,964

24.5

20,267

LLM — EAGLE Speculative Decoding#

Draft Models#

Base Model

Draft Model

Source

Qwen3-1.7B

Qwen3-1.7B_eagle3

AngelSlim/Qwen3-1.7B_eagle3

Note: Both base and draft models are quantized to NVFP4.

Model

Base Prec

Draft Prec

Batch

Prefill (ms)

Prefill Tokens

Generation (tok/s)

Accept Rate

Speedup

Qwen3-1.7B

NVFP4

NVFP4

1

14.5

370

312.4

3.75

1.83x

Qwen3-1.7B

NVFP4

NVFP4

8

153.5

2,959

828.8

3.73

1.04x


v0.4.0 Results#

SDK Version: TensorRT Edge-LLM 0.4.0  |  TensorRT: 10.13

LLM — Vanilla Decoding#

Model

Precision

Batch

Prefill (ms)

Prefill Tokens

Prefill (tok/s)

Generation (tok/s)

Llama-3.1-8B-Instruct

INT4 AWQ

1

215.5

383

1,777

50.8

Llama-3.1-8B-Instruct

INT4 AWQ

8

2737.4

3064

1,119

135.3

Llama-3.1-8B-Instruct

NVFP4

1

31.0

383

12,355

54.9

Llama-3.1-8B-Instruct

NVFP4

8

387.6

3064

7,905

308.7

Qwen3-0.6B

INT4 AWQ

1

21.0

366

17,429

270.2

Qwen3-0.6B

INT4 AWQ

8

241.8

2927

12,104

828.0

Qwen3-0.6B

NVFP4

1

8.8

366

41,591

318.6

Qwen3-0.6B

NVFP4

8

95.4

2927

30,681

1562.4

Qwen3-4B-Instruct-2507

INT4 AWQ

1

116.2

364

3,133

76.4

Qwen3-4B-Instruct-2507

INT4 AWQ

8

1502.3

2911

1,938

240.3

Qwen3-4B-Instruct-2507

NVFP4

1

22.9

364

15,895

90.2

Qwen3-4B-Instruct-2507

NVFP4

8

301.9

2911

9,642

507.4

Qwen3-8B

INT4 AWQ

1

212.0

366

1,726

47.7

Qwen3-8B

INT4 AWQ

8

2719.1

2927

1,076

162.3

Qwen3-8B

NVFP4

1

32.8

366

11,159

53.7

Qwen3-8B

NVFP4

8

425.8

2927

6,874

372.2

Vision Language Model — Vanilla Decoding#

Model

LLM Prec

ViT Prec

Prefill (ms)

Prefill Tokens

Prefill (tok/s)

ViT Time (ms)

ViT Tok/Run

ViT (tok/s)

Generation (tok/s)

Qwen2.5-VL-7B-Instruct

INT4 AWQ

FP16

195.1

376

1,927

51.1

344

6,732

53.1

Qwen2.5-VL-7B-Instruct

INT4 AWQ

FP8

195.1

376

1,927

42.7

344

8,056

53.1

Qwen2.5-VL-7B-Instruct

NVFP4

FP16

25.7

376

14,631

51.0

344

6,745

57.7

Qwen2.5-VL-7B-Instruct

NVFP4

FP8

25.7

376

14,631

42.6

344

8,075

57.6

Qwen3-VL-2B-Instruct

INT4 AWQ

FP16

39.4

283

7,183

19.0

262

13,789

144.4

Qwen3-VL-2B-Instruct

INT4 AWQ

FP8

39.4

283

7,183

15.4

262

17,013

144.7

Qwen3-VL-2B-Instruct

NVFP4

FP16

10.1

283

28,020

19.0

262

13,789

180.8

Qwen3-VL-2B-Instruct

NVFP4

FP8

10.1

283

28,020

15.5

262

16,903

181.0

Note: ViT time = per-token ViT latency x image tokens per run. FP8 ViT reduces visual encoder time by ~17% compared to FP16 with negligible impact on generation throughput.

LLM — EAGLE Speculative Decoding#

Draft Models#

Base Model

Draft Model

Source

Llama-3.1-8B-Instruct

EAGLE3-LLaMA3.1-Instruct-8B

yuhuili/EAGLE3-LLaMA3.1-Instruct-8B

Qwen3-8B

qwen3_8b_eagle3

Tengyunw/qwen3_8b_eagle3

Note: Both base and draft models are quantized to the same precision (INT4 AWQ or NVFP4) as listed in the table below.

Model

Base Prec

Draft Prec

Batch

Prefill (ms)

Prefill Tokens

Generation (tok/s)

Accept Rate

Speedup

Llama-3.1-8B-Instruct

INT4 AWQ

INT4 AWQ

1

215.2

382

81.0

5.25

1.59x

Llama-3.1-8B-Instruct

INT4 AWQ

INT4 AWQ

8

2735.5

3056

118.0

5.21

0.87x

Llama-3.1-8B-Instruct

NVFP4

NVFP4

1

30.8

382

189.2

5.21

3.45x

Llama-3.1-8B-Instruct

NVFP4

NVFP4

8

413.1

3056

484.7

5.15

1.57x

Qwen3-8B

INT4 AWQ

INT4 AWQ

1

212.2

366

66.1

4.36

1.39x

Qwen3-8B

INT4 AWQ

INT4 AWQ

8

2719.1

2927

99.1

4.31

0.61x

Qwen3-8B

NVFP4

NVFP4

1

33.1

366

151.7

4.26

2.82x

Qwen3-8B

NVFP4

NVFP4

8

429.1

2927

457.7

4.25

1.23x

Note: EAGLE speculative decoding provides the greatest speedup at BS=1 (latency-bound). At BS=8, base model compute is already well-utilized, limiting speculative acceleration. See Speculative Decoding for setup instructions.

Vision Language Model — EAGLE Speculative Decoding#

Draft Models#

Base Model

Draft Model

Source

Qwen2.5-VL-7B-Instruct

qwen2.5-vl-7b-eagle3-sgl

Rayzl/qwen2.5-vl-7b-eagle3-sgl

Note: Both base and draft models are quantized to the same precision as listed in the table below.

Model

Base Prec

Draft Prec

ViT Prec

Prefill (ms)

Prefill Tokens

Generation (tok/s)

Accept Rate

Speedup

Qwen2.5-VL-7B-Instruct

INT4 AWQ

INT4 AWQ

FP16

195.1

376

57.3

3.66

1.08x

Qwen2.5-VL-7B-Instruct

NVFP4

NVFP4

FP16

25.8

376

149.6

3.82

2.59x

Qwen2.5-VL-7B-Instruct

NVFP4

NVFP4

FP8

32.8

376

117.3

3.76

2.04x


Key Observations#

v0.8.0#

  • All release devices are covered: v0.8.0 adds Jetson AGX Thor, Jetson AGX Orin 64GB, Jetson Orin NX 16GB, and Jetson Orin Nano 8GB results from the release benchmark outputs, benchmarked under JetPack 7.2.

  • First llm_bench prefill publication: AGX Thor includes parsed llm_bench prefill measurements at inputLen=2048. Qwen3-0.6B NVFP4 reaches 91,469.3 tok/s at BS=1; Qwen3-1.7B EAGLE NVFP4 reaches 66,929.4 tok/s at BS=1.

  • Speculative decode remains platform-dependent: EAGLE and MTP report strong acceptance rates, but generation throughput depends heavily on model size, precision, and platform memory bandwidth.

v0.7.1#

  • MTP speculative decoding: This is the v0.7.1 performance highlight. Qwen3.5 MTP improves BS=1 generation throughput by 1.21x for 0.8B, 1.44x for 2B, and 2.12x for 27B over vanilla decoding, with BS=8 throughput up to 1,056.7 tok/s for Qwen3.5-0.8B.

v0.7.0#

  • MoE support: Qwen3-30B-A3B-GPTQ-Int4 (MoE, 3B active params out of 30B) achieves 81.3 tok/s at BS=1 and 223.2 tok/s at BS=8 with INT4 GPTQ, demonstrating efficient sparse model inference on edge.

  • Small model throughput: Qwen3-1.7B with NVFP4 delivers 170.4 tok/s at BS=1 and 798.8 tok/s at BS=8, suitable for latency-sensitive edge applications.

  • Qwen3.5 VLM family: Ranges from 232.2 tok/s (0.8B) to 10.5 tok/s (27B), providing a scalable VLM option across memory and throughput budgets.

  • Nemotron-3-Nano-Omni-30B-A3B: The first audio+video multimodal model benchmarked, achieving 24.5 tok/s generation at 20 GB GPU memory.

v0.4.0#

  • NVFP4 delivers highest throughput: NVFP4 achieves 1.1–2.3x higher generation throughput than INT4 AWQ, with substantially faster prefill (e.g., 31 ms vs 216 ms for Llama-3.1-8B at BS=1).

  • EAGLE at BS=1 provides meaningful speedup: 1.4–3.5x for LLMs, best for Llama-3.1-8B NVFP4 (3.45x). The draft model acceptance rate is high for Llama (~5.2 tokens/step) and moderate for Qwen3-8B (~4.3 tokens/step).

  • EAGLE at BS=8 has limited benefit: At high batch sizes, base model compute is already well-utilized. Speedup drops to <1x for INT4 AWQ and 1.2–1.6x for NVFP4.

  • Qwen3-0.6B achieves the highest throughput: 1562 tok/s at BS=8 with NVFP4 — a lightweight model well-suited for latency-sensitive edge applications.

General#

  • Benchmarks use default TensorRT Edge-LLM inference settings on the listed device. Production performance may vary with system-level tuning (power mode, memory configuration, thermal management).