Vocabulary Reduction#

Overview#

Vocabulary reduction optimizes LM head computation by reducing the vocabulary size to a subset of relevant tokens. This feature only speeds up the LM head layer — the rest of the model remains unchanged.

Important: The user is responsible for creating the vocabulary mapping. Since the optimal reduced vocabulary is directly tied to your specific task and output distribution, we only provide a simple reference script (tensorrt-edgellm-reduce-vocab). You should customize the vocabulary selection based on your use case.

Quick Start#

End-to-End Workflow (Standard Decoding)#

Using Qwen3-0.6B as an example — smaller models benefit most from vocabulary reduction since LM head represents a larger fraction of total compute:

# Step 1: Generate vocabulary mapping
tensorrt-edgellm-reduce-vocab \
  --model_dir Qwen/Qwen3-0.6B \
  --output_dir reduced_vocab \
  --reduced_vocab_size 16384 \
  --method input_aware \
  --max_samples 50000

# Step 2: Export model with reduced vocabulary
tensorrt-edgellm-export-llm \
  --model_dir Qwen/Qwen3-0.6B \
  --output_dir llm_onnx \
  --reduced_vocab_dir reduced_vocab/

# Step 3: Build TensorRT engine (unchanged)
./build/examples/llm/llm_build \
  --onnxDir llm_onnx \
  --engineDir engines/qwen3-0.6b \
  --maxBatchSize 1

# Step 4: Run inference (unchanged)
./build/examples/llm/llm_inference \
  --engineDir engines/qwen3-0.6b \
  --inputFile input.json \
  --outputFile output.json

The runtime automatically applies vocabulary reduction when vocab_map.safetensors is present in the engine directory.

EAGLE Speculative Decoding Support#

When using vocabulary reduction for EAGLE base models, you must include all tokens referenced in the draft’s d2t.safetensors mapping.

Prerequisite: Export the draft model first using tensorrt-edgellm-export-draft to generate d2t.safetensors, then use the --d2t_path flag:

# Step 0: Export EAGLE draft model first (generates d2t.safetensors)
tensorrt-edgellm-export-draft \
  --draft_model_dir EAGLE3-Qwen3-4B-Instruct-2507 \
  --base_model_dir Qwen/Qwen3-4B-Instruct-2507 \
  --output_dir draft_onnx

# Step 1: Generate vocabulary mapping with d2t constraint
tensorrt-edgellm-reduce-vocab \
  --model_dir Qwen/Qwen3-4B-Instruct-2507 \
  --output_dir reduced_vocab \
  --reduced_vocab_size 16384 \
  --method input_aware \
  --d2t_path draft_onnx/d2t.safetensors

# Step 2: Export base model with reduced vocabulary
tensorrt-edgellm-export-llm \
  --model_dir Qwen/Qwen3-4B-Instruct-2507 \
  --output_dir llm_onnx \
  --reduced_vocab_dir reduced_vocab/ \
  --is_eagle_base

# Step 3-4: Build and run as usual...

This ensures all draft-to-target token mappings remain valid after vocabulary reduction.

Script Reference#

Argument	Required	Default	Description
`--model_dir`	Yes	-	Path to model directory (tokenizer + config)
`--output_dir`	Yes	-	Output directory for vocabulary files
`--reduced_vocab_size`	Yes	-	Target vocabulary size
`--method`	No	`input_aware`	`input_aware` or `frequency`
`--max_samples`	No	`50000`	Max samples from dataset
`--d2t_path`	No	-	EAGLE d2t.safetensors path

Methods:

input_aware: Analyzes CNN/DailyMail summaries + documents, applies input-aware filtering
frequency: Simple token frequency analysis on input documents

Custom Vocabulary Mapping#

The provided script uses CNN/DailyMail as a reference dataset. For production use, create your own files with the following formats:

`vocab_map.safetensors`#

Key	Type	Shape	Description
`vocab_map`	`int32`	`(reduced_vocab_size,)`	Sorted original token IDs to keep (must include EOS token)

import torch
from safetensors.torch import save_file

# Select original token IDs to keep (must be sorted, must include EOS)
selected_tokens = [0, 1, 2, 100, 101, 500, 1000, 1001, ...]
vocab_map = torch.tensor(sorted(selected_tokens), dtype=torch.int32)

save_file({"vocab_map": vocab_map}, "vocab_map.safetensors")

`reduced_vocab.json`#

{
  "vocab_size": 151936,
  "reduced_vocab_size": 16384
}

Notes#

Vocabulary reduction only affects LM head computation; other layers are unchanged
The vocab_map.safetensors is automatically copied to the engine directory during export
Runtime transparently handles the mapping — no inference code changes required
Ensure your reduced vocabulary covers all tokens your model needs to generate