Vocabulary Reduction#
Overview#
Vocabulary reduction optimizes LM head computation by reducing the vocabulary size to a subset of relevant tokens. This feature only speeds up the LM head layer — the rest of the model remains unchanged.
Important: The user is responsible for creating the vocabulary mapping. Since the optimal reduced vocabulary is directly tied to your specific task and output distribution, we only provide a simple reference script (tensorrt-edgellm-reduce-vocab). You should customize the vocabulary selection based on your use case.
Quick Start#
End-to-End Workflow (Standard Decoding)#
Using Qwen3-0.6B as an example — smaller models benefit most from vocabulary reduction since LM head represents a larger fraction of total compute:
# Step 1: Generate vocabulary mapping
tensorrt-edgellm-reduce-vocab \
--model_dir Qwen/Qwen3-0.6B \
--output_dir reduced_vocab \
--reduced_vocab_size 16384 \
--method input_aware \
--max_samples 50000
# Step 2: Export model with reduced vocabulary
tensorrt-edgellm-export-llm \
--model_dir Qwen/Qwen3-0.6B \
--output_dir llm_onnx \
--reduced_vocab_dir reduced_vocab/
# Step 3: Build TensorRT engine (unchanged)
./build/examples/llm/llm_build \
--onnxDir llm_onnx \
--engineDir engines/qwen3-0.6b \
--maxBatchSize 1
# Step 4: Run inference (unchanged)
./build/examples/llm/llm_inference \
--engineDir engines/qwen3-0.6b \
--inputFile input.json \
--outputFile output.json
The runtime automatically applies vocabulary reduction when vocab_map.safetensors is present in the engine directory.
EAGLE Speculative Decoding Support#
When using vocabulary reduction for EAGLE base models, you must include all tokens referenced in the draft’s d2t.safetensors mapping.
Prerequisite: Export the draft model first using tensorrt-edgellm-export-draft to generate d2t.safetensors, then use the --d2t_path flag:
# Step 0: Export EAGLE draft model first (generates d2t.safetensors)
tensorrt-edgellm-export-draft \
--draft_model_dir EAGLE3-Qwen3-4B-Instruct-2507 \
--base_model_dir Qwen/Qwen3-4B-Instruct-2507 \
--output_dir draft_onnx
# Step 1: Generate vocabulary mapping with d2t constraint
tensorrt-edgellm-reduce-vocab \
--model_dir Qwen/Qwen3-4B-Instruct-2507 \
--output_dir reduced_vocab \
--reduced_vocab_size 16384 \
--method input_aware \
--d2t_path draft_onnx/d2t.safetensors
# Step 2: Export base model with reduced vocabulary
tensorrt-edgellm-export-llm \
--model_dir Qwen/Qwen3-4B-Instruct-2507 \
--output_dir llm_onnx \
--reduced_vocab_dir reduced_vocab/ \
--is_eagle_base
# Step 3-4: Build and run as usual...
This ensures all draft-to-target token mappings remain valid after vocabulary reduction.
Script Reference#
Argument |
Required |
Default |
Description |
|---|---|---|---|
|
Yes |
- |
Path to model directory (tokenizer + config) |
|
Yes |
- |
Output directory for vocabulary files |
|
Yes |
- |
Target vocabulary size |
|
No |
|
|
|
No |
|
Max samples from dataset |
|
No |
- |
EAGLE d2t.safetensors path |
Methods:
input_aware: Analyzes CNN/DailyMail summaries + documents, applies input-aware filteringfrequency: Simple token frequency analysis on input documents
Custom Vocabulary Mapping#
The provided script uses CNN/DailyMail as a reference dataset. For production use, create your own files with the following formats:
vocab_map.safetensors#
Key |
Type |
Shape |
Description |
|---|---|---|---|
|
|
|
Sorted original token IDs to keep (must include EOS token) |
import torch
from safetensors.torch import save_file
# Select original token IDs to keep (must be sorted, must include EOS)
selected_tokens = [0, 1, 2, 100, 101, 500, 1000, 1001, ...]
vocab_map = torch.tensor(sorted(selected_tokens), dtype=torch.int32)
save_file({"vocab_map": vocab_map}, "vocab_map.safetensors")
reduced_vocab.json#
{
"vocab_size": 151936,
"reduced_vocab_size": 16384
}
Notes#
Vocabulary reduction only affects LM head computation; other layers are unchanged
The
vocab_map.safetensorsis automatically copied to the engine directory during exportRuntime transparently handles the mapping — no inference code changes required
Ensure your reduced vocabulary covers all tokens your model needs to generate