Vocabulary Reduction#

Overview#

Vocabulary reduction optimizes LM head computation by reducing the vocabulary size to a subset of relevant tokens. This feature only speeds up the LM head layer — the rest of the model remains unchanged.

Important: The user is responsible for creating the vocabulary mapping. Since the optimal reduced vocabulary is directly tied to your specific task and output distribution, we only provide a simple reference script (tensorrt-edgellm-reduce-vocab). You should customize the vocabulary selection based on your use case.


Quick Start#

End-to-End Workflow (Standard Decoding)#

Using Qwen3-0.6B as an example — smaller models benefit most from vocabulary reduction since LM head represents a larger fraction of total compute:

# Step 1: Generate vocabulary mapping
tensorrt-edgellm-reduce-vocab \
  --model_dir Qwen/Qwen3-0.6B \
  --output_dir reduced_vocab \
  --reduced_vocab_size 16384 \
  --method input_aware \
  --max_samples 50000

# Step 2: Export model with reduced vocabulary
tensorrt-edgellm-export-llm \
  --model_dir Qwen/Qwen3-0.6B \
  --output_dir llm_onnx \
  --reduced_vocab_dir reduced_vocab/

# Step 3: Build TensorRT engine (unchanged)
./build/examples/llm/llm_build \
  --onnxDir llm_onnx \
  --engineDir engines/qwen3-0.6b \
  --maxBatchSize 1

# Step 4: Run inference (unchanged)
./build/examples/llm/llm_inference \
  --engineDir engines/qwen3-0.6b \
  --inputFile input.json \
  --outputFile output.json

The runtime automatically applies vocabulary reduction when vocab_map.safetensors is present in the engine directory.


EAGLE Speculative Decoding Support#

When using vocabulary reduction for EAGLE base models, you must include all tokens referenced in the draft’s d2t.safetensors mapping.

Prerequisite: Export the draft model first using tensorrt-edgellm-export-draft to generate d2t.safetensors, then use the --d2t_path flag:

# Step 0: Export EAGLE draft model first (generates d2t.safetensors)
tensorrt-edgellm-export-draft \
  --draft_model_dir EAGLE3-Qwen3-4B-Instruct-2507 \
  --base_model_dir Qwen/Qwen3-4B-Instruct-2507 \
  --output_dir draft_onnx

# Step 1: Generate vocabulary mapping with d2t constraint
tensorrt-edgellm-reduce-vocab \
  --model_dir Qwen/Qwen3-4B-Instruct-2507 \
  --output_dir reduced_vocab \
  --reduced_vocab_size 16384 \
  --method input_aware \
  --d2t_path draft_onnx/d2t.safetensors

# Step 2: Export base model with reduced vocabulary
tensorrt-edgellm-export-llm \
  --model_dir Qwen/Qwen3-4B-Instruct-2507 \
  --output_dir llm_onnx \
  --reduced_vocab_dir reduced_vocab/ \
  --is_eagle_base

# Step 3-4: Build and run as usual...

This ensures all draft-to-target token mappings remain valid after vocabulary reduction.


Script Reference#

Argument

Required

Default

Description

--model_dir

Yes

-

Path to model directory (tokenizer + config)

--output_dir

Yes

-

Output directory for vocabulary files

--reduced_vocab_size

Yes

-

Target vocabulary size

--method

No

input_aware

input_aware or frequency

--max_samples

No

50000

Max samples from dataset

--d2t_path

No

-

EAGLE d2t.safetensors path

Methods:

  • input_aware: Analyzes CNN/DailyMail summaries + documents, applies input-aware filtering

  • frequency: Simple token frequency analysis on input documents


Custom Vocabulary Mapping#

The provided script uses CNN/DailyMail as a reference dataset. For production use, create your own files with the following formats:

vocab_map.safetensors#

Key

Type

Shape

Description

vocab_map

int32

(reduced_vocab_size,)

Sorted original token IDs to keep (must include EOS token)

import torch
from safetensors.torch import save_file

# Select original token IDs to keep (must be sorted, must include EOS)
selected_tokens = [0, 1, 2, 100, 101, 500, 1000, 1001, ...]
vocab_map = torch.tensor(sorted(selected_tokens), dtype=torch.int32)

save_file({"vocab_map": vocab_map}, "vocab_map.safetensors")

reduced_vocab.json#

{
  "vocab_size": 151936,
  "reduced_vocab_size": 16384
}

Notes#

  • Vocabulary reduction only affects LM head computation; other layers are unchanged

  • The vocab_map.safetensors is automatically copied to the engine directory during export

  • Runtime transparently handles the mapping — no inference code changes required

  • Ensure your reduced vocabulary covers all tokens your model needs to generate