Vocabulary Reduction#
Overview#
Vocabulary reduction restricts generation logits to a task-specific subset of token IDs. The transformer layers and input embedding table keep the original tokenizer vocabulary; only the exported logits and runtime sampling vocabulary are reduced.
Important: The user is responsible for creating the vocabulary mapping. Since the optimal reduced vocabulary is directly tied to your specific task and output distribution, we only provide a simple reference script (python -m llm_loader.vocab_reduction). You should customize the vocabulary selection based on your use case.
Quick Start#
End-to-End Workflow (Standard Decoding)#
Using Qwen3-0.6B as an example — smaller models benefit most from vocabulary reduction since LM head represents a larger fraction of total compute:
export PYTHONPATH=/path/to/TensorRT-Edge-LLM:/path/to/TensorRT-Edge-LLM/experimental:$PYTHONPATH
# Optional: quantize the source checkpoint first
python -m experimental.quantization llm \
--model_dir Qwen/Qwen3-0.6B \
--output_dir qwen3_0_6b_nvfp4 \
--quantization nvfp4 \
--lm_head_quantization nvfp4
# Step 1: Generate vocabulary mapping
python -m llm_loader.vocab_reduction \
--model_dir Qwen/Qwen3-0.6B \
--output_dir reduced_vocab \
--reduced_vocab_size 16384 \
--method input_aware \
--max_samples 50000
# Step 2: Export model with reduced vocabulary
python -m llm_loader.export_all_cli \
qwen3_0_6b_nvfp4 \
qwen3_0_6b_onnx \
--reduced-vocab-dir reduced_vocab/
# Step 3: Build TensorRT engine (unchanged)
./build/examples/llm/llm_build \
--onnxDir qwen3_0_6b_onnx/llm \
--engineDir engines/qwen3-0.6b \
--maxBatchSize 1
# Step 4: Run inference (unchanged)
./build/examples/llm/llm_inference \
--engineDir engines/qwen3-0.6b \
--inputFile input.json \
--outputFile output.json
The runtime automatically applies vocabulary reduction when config.json has reduced_vocab_size and vocab_map.safetensors is present. llm_loader reduces the LM-head weights during export, including FP16, FP8, NVFP4, MXFP8, INT8 SmoothQuant, and packed INT4 LM heads. Packed INT4 LM heads require group_size=128 and a reduced_vocab_size that is a multiple of 128.
EAGLE Speculative Decoding Support#
When using vocabulary reduction for EAGLE base models, you must include all tokens referenced in the draft’s d2t.safetensors mapping.
Prerequisite: Export the draft model first to generate d2t.safetensors, then use the --d2t_path flag:
# Step 0: Export EAGLE draft model first (generates d2t.safetensors)
python -m llm_loader.export_all_cli \
AngelSlim/Qwen3-4B_eagle3 \
draft_onnx
# Step 1: Generate vocabulary mapping with d2t constraint
python -m llm_loader.vocab_reduction \
--model_dir Qwen/Qwen3-4B \
--output_dir reduced_vocab \
--reduced_vocab_size 16384 \
--method input_aware \
--d2t_path draft_onnx/llm/d2t.safetensors
# Step 2: Export base model with reduced vocabulary
python -m llm_loader.export_all_cli \
Qwen/Qwen3-4B \
base_onnx \
--eagle-base \
--reduced-vocab-dir reduced_vocab/
# Step 3-4: Build and run as usual...
This ensures all draft-to-target token mappings remain valid after vocabulary reduction.
Script Reference#
Argument |
Required |
Default |
Description |
|---|---|---|---|
|
Yes |
- |
Path to model directory (tokenizer + config) |
|
Yes |
- |
Output directory for vocabulary files |
|
Yes |
- |
Target vocabulary size |
|
No |
|
|
|
No |
|
Max samples from dataset |
|
No |
- |
EAGLE d2t.safetensors path |
Methods:
input_aware: Analyzes CNN/DailyMail summaries + documents, applies input-aware filteringfrequency: Simple token frequency analysis on input documents
Custom Vocabulary Mapping#
The provided script uses CNN/DailyMail as a reference dataset. For production use, create your own files with the following formats:
vocab_map.safetensors#
Key |
Type |
Shape |
Description |
|---|---|---|---|
|
|
|
Sorted original token IDs to keep (must include EOS token) |
import torch
from safetensors.torch import save_file
# Select original token IDs to keep (must be sorted, must include EOS)
selected_tokens = [0, 1, 2, 100, 101, 500, 1000, 1001, ...]
vocab_map = torch.tensor(sorted(selected_tokens), dtype=torch.int32)
save_file({"vocab_map": vocab_map}, "vocab_map.safetensors")
reduced_vocab.json#
{
"vocab_size": 151936,
"reduced_vocab_size": 16384
}
Notes#
Vocabulary reduction only affects output logits and sampling; other layers are unchanged
The
vocab_map.safetensorsfile is automatically copied during export and engine buildRuntime transparently handles the mapping — no inference code changes required
Ensure your reduced vocabulary covers all tokens your model needs to generate