Token Encoder#

class TokenEncoder#

TokenEncoder class for different tokenization encoding algorithms.

Public Types

enum Type#

Token encoder algorithm type.

Values:

enumerator BPE#

Byte Pair Encoding.

enumerator SENTENCEPIECE#

SentencePiece encoding (unimplemented)

enumerator WORDPIECE#

WordPiece encoding (unimplemented)

Public Functions

TokenEncoder(Type type = BPE)#

Constructor for token encoder.

Parameters:

type – Encoder algorithm type (default: BPE)

~TokenEncoder() = default#
bool initialize(
TokenToRanks const &vocab,
TokenToRanks const &specialTokens = {}
)#

Initialize with vocabulary.

Parameters:
  • vocab – Main vocabulary mapping

  • specialTokens – Special tokens mapping

Returns:

true if vocab is non-empty and initialization completes; false if vocab is empty

bool encode(
std::string const &piece,
std::vector<Rank> &output
) const#

Encode a piece of text using the algorithm.

Parameters:
  • piece – Input text piece (already pretokenized)

  • output – Vector to store token IDs

Returns:

true if piece is within size limits and encoding completes successfully; false if piece exceeds 1MB limit or encoding algorithm fails

bool decode(
std::vector<Rank> const &tokens,
std::string &output,
bool skipSpecialTokens = false
) const#

Decode token IDs back to text.

Parameters:
  • tokens – Vector of token IDs

  • output – String to store result

  • skipSpecialTokens – Whether to skip special tokens

Returns:

true if all tokens are found in vocabulary or skipped successfully; false if unknown tokens are encountered and skipSpecialTokens is false

inline Type getType() const noexcept#

Get encoder type.

Returns:

Encoder algorithm type

inline size_t getVocabSize() const noexcept#

Get vocabulary size.

Returns:

Total number of tokens in vocabulary

bool hasToken(std::string const &token) const#

Check if token exists in vocabulary.

Parameters:

token – Token string to check

Returns:

true if token exists, false otherwise

Rank getTokenRank(std::string const &token) const#

Get token rank from token string.

Parameters:

token – Token string

Returns:

Token rank/ID

std::string getRankToken(Rank rank) const#

Get token string from rank.

Parameters:

rank – Token rank/ID

Returns:

Token string