Token Encoder#

class TokenEncoder#

TokenEncoder class for different tokenization encoding algorithms.

Public Types

enum Type#

Token encoder algorithm type.

Values:

enumerator BPE#

Byte Pair Encoding.

enumerator SENTENCEPIECE#

SentencePiece encoding (unimplemented)

enumerator WORDPIECE#

WordPiece encoding (unimplemented)

Public Functions

TokenEncoder(Type type = BPE) noexcept#

Constructor for token encoder.

Parameters:

type – Encoder algorithm type (default: BPE)

~TokenEncoder() noexcept = default#
bool initialize(
TokenToRanks const &vocab,
TokenToRanks const &specialTokens = {}
)#

Initialize with vocabulary.

Parameters:
  • vocab – Main vocabulary mapping

  • specialTokens – Special tokens mapping

Returns:

true if vocab is non-empty and initialization completes; false if vocab is empty

bool encode(
std::string const &piece,
std::vector<Rank> &output
) const noexcept#

Encode a piece of text using the algorithm.

Parameters:
  • piece – Input text piece (already pretokenized)

  • output – Vector to store token IDs

Returns:

true if piece is within size limits and encoding completes successfully; false if piece exceeds 1MB limit or encoding algorithm fails

bool decode(
std::vector<Rank> const &tokens,
std::string &output,
bool skipSpecialTokens = false
) const noexcept#

Decode token IDs back to text.

Parameters:
  • tokens – Vector of token IDs

  • output – String to store result

  • skipSpecialTokens – Whether to skip special tokens

Returns:

true if all tokens are found in vocabulary or skipped successfully; false if unknown tokens are encountered and skipSpecialTokens is false

inline Type getType() const noexcept#

Get encoder type.

Returns:

Encoder algorithm type

inline size_t getVocabSize() const noexcept#

Get vocabulary size.

Returns:

Total number of tokens in vocabulary

bool hasToken(std::string const &token) const noexcept#

Check if token exists in vocabulary.

Parameters:

token – Token string to check

Returns:

true if token exists, false otherwise

Rank getTokenRank(std::string const &token) const noexcept#

Get token rank from token string.

Parameters:

token – Token string

Returns:

Token rank/ID

std::string getRankToken(Rank rank) const#

Get token string from rank.

Parameters:

rank – Token rank/ID

Returns:

Token string

inline void setByteFallback(bool enable) noexcept#

Enable or disable byte-level fallback for unknown tokens.

Parameters:

enable – Whether to enable byte fallback

inline void setMergePriorities(
std::unordered_map<std::string, Rank> mergePriorities
) noexcept#

Set explicit BPE merge priorities from tokenizer.json merges list.

When set, the BPE algorithm uses these priorities instead of vocabulary rank to determine merge order. This is required for SentencePiece-style tokenizers (e.g. Gemma) where merge order != vocabulary rank order.

Parameters:

mergePriorities – Map from null-byte-separated pair key (“a\0b”) to merge priority (0 = highest)