Token Encoder#
-
class TokenEncoder#
TokenEncoder class for different tokenization encoding algorithms.
Public Types
Public Functions
-
TokenEncoder(Type type = BPE) noexcept#
Constructor for token encoder.
- Parameters:
type – Encoder algorithm type (default: BPE)
-
~TokenEncoder() noexcept = default#
- bool initialize(
- TokenToRanks const &vocab,
- TokenToRanks const &specialTokens = {}
Initialize with vocabulary.
- Parameters:
vocab – Main vocabulary mapping
specialTokens – Special tokens mapping
- Returns:
true if vocab is non-empty and initialization completes; false if vocab is empty
- bool encode(
- std::string const &piece,
- std::vector<Rank> &output
Encode a piece of text using the algorithm.
- Parameters:
piece – Input text piece (already pretokenized)
output – Vector to store token IDs
- Returns:
true if piece is within size limits and encoding completes successfully; false if piece exceeds 1MB limit or encoding algorithm fails
- bool decode(
- std::vector<Rank> const &tokens,
- std::string &output,
- bool skipSpecialTokens = false
Decode token IDs back to text.
- Parameters:
tokens – Vector of token IDs
output – String to store result
skipSpecialTokens – Whether to skip special tokens
- Returns:
true if all tokens are found in vocabulary or skipped successfully; false if unknown tokens are encountered and skipSpecialTokens is false
-
inline size_t getVocabSize() const noexcept#
Get vocabulary size.
- Returns:
Total number of tokens in vocabulary
-
bool hasToken(std::string const &token) const noexcept#
Check if token exists in vocabulary.
- Parameters:
token – Token string to check
- Returns:
true if token exists, false otherwise
-
Rank getTokenRank(std::string const &token) const noexcept#
Get token rank from token string.
- Parameters:
token – Token string
- Returns:
Token rank/ID
-
std::string getRankToken(Rank rank) const#
Get token string from rank.
- Parameters:
rank – Token rank/ID
- Returns:
Token string
-
inline void setByteFallback(bool enable) noexcept#
Enable or disable byte-level fallback for unknown tokens.
- Parameters:
enable – Whether to enable byte fallback
- inline void setMergePriorities(
- std::unordered_map<std::string, Rank> mergePriorities
Set explicit BPE merge priorities from tokenizer.json merges list.
When set, the BPE algorithm uses these priorities instead of vocabulary rank to determine merge order. This is required for SentencePiece-style tokenizers (e.g. Gemma) where merge order != vocabulary rank order.
- Parameters:
mergePriorities – Map from null-byte-separated pair key (“a\0b”) to merge priority (0 = highest)
-
TokenEncoder(Type type = BPE) noexcept#