Token Encoder#

class TokenEncoder#

TokenEncoder class for different tokenization encoding algorithms.

Public Types

enum Type#

Token encoder algorithm type.

Values:

enumerator BPE#: Byte Pair Encoding.

enumerator SENTENCEPIECE#: SentencePiece encoding (unimplemented)

enumerator WORDPIECE#: WordPiece encoding (unimplemented)

Public Functions

TokenEncoder(Type type = BPE) noexcept#

Constructor for token encoder.

Parameters:: type – Encoder algorithm type (default: BPE)

~TokenEncoder() noexcept = default#

bool initialize( TokenToRanks const &vocab, TokenToRanks const &specialTokens = {}, )#

Initialize with vocabulary.

Parameters:

vocab – Main vocabulary mapping
specialTokens – Special tokens mapping

Returns:

true if vocab is non-empty and initialization completes; false if vocab is empty

bool encode( std::string const &piece, std::vector<Rank> &output, ) const noexcept#

Encode a piece of text using the algorithm.

Parameters:

piece – Input text piece (already pretokenized)
output – Vector to store token IDs

Returns:

true if piece is within size limits and encoding completes successfully; false if piece exceeds 1MB limit or encoding algorithm fails

bool decode( std::vector<Rank> const &tokens, std::string &output, bool skipSpecialTokens = false, ) const noexcept#

Decode token IDs back to text.

Parameters:

tokens – Vector of token IDs
output – String to store result
skipSpecialTokens – Whether to skip special tokens

Returns:

true if all tokens are found in vocabulary or skipped successfully; false if unknown tokens are encountered and skipSpecialTokens is false

inline Type getType() const noexcept#

Get encoder type.

Returns:: Encoder algorithm type

inline size_t getVocabSize() const noexcept#

Get vocabulary size.

Returns:: Total number of tokens in vocabulary

bool hasToken(std::string const &token) const noexcept#

Check if token exists in vocabulary.

Parameters:: token – Token string to check
Returns:: true if token exists, false otherwise

Rank getTokenRank(std::string const &token) const noexcept#

Get token rank from token string.

Parameters:: token – Token string
Returns:: Token rank/ID

std::string getRankToken(Rank rank) const#

Get token string from rank.

Parameters:: rank – Token rank/ID
Returns:: Token string

inline void setByteFallback(bool enable) noexcept#

Enable or disable byte-level fallback for unknown tokens.

Parameters:: enable – Whether to enable byte fallback

inline void setMergePriorities( std::unordered_map<std::string, Rank> mergePriorities, ) noexcept#

Set explicit BPE merge priorities from tokenizer.json merges list.

When set, the BPE algorithm uses these priorities instead of vocabulary rank to determine merge order. This is required for SentencePiece-style tokenizers (e.g. Gemma) where merge order != vocabulary rank order.

Parameters:: mergePriorities – Map from null-byte-separated pair key (“a\0b”) to merge priority (0 = highest)