Token Encoder#
-
class TokenEncoder#
TokenEncoder class for different tokenization encoding algorithms.
Public Types
Public Functions
-
TokenEncoder(Type type = BPE)#
Constructor for token encoder.
- Parameters:
type – Encoder algorithm type (default: BPE)
-
~TokenEncoder() = default#
- bool initialize(
- TokenToRanks const &vocab,
- TokenToRanks const &specialTokens = {}
Initialize with vocabulary.
- Parameters:
vocab – Main vocabulary mapping
specialTokens – Special tokens mapping
- Returns:
true if vocab is non-empty and initialization completes; false if vocab is empty
- bool encode(
- std::string const &piece,
- std::vector<Rank> &output
Encode a piece of text using the algorithm.
- Parameters:
piece – Input text piece (already pretokenized)
output – Vector to store token IDs
- Returns:
true if piece is within size limits and encoding completes successfully; false if piece exceeds 1MB limit or encoding algorithm fails
- bool decode(
- std::vector<Rank> const &tokens,
- std::string &output,
- bool skipSpecialTokens = false
Decode token IDs back to text.
- Parameters:
tokens – Vector of token IDs
output – String to store result
skipSpecialTokens – Whether to skip special tokens
- Returns:
true if all tokens are found in vocabulary or skipped successfully; false if unknown tokens are encountered and skipSpecialTokens is false
-
inline size_t getVocabSize() const noexcept#
Get vocabulary size.
- Returns:
Total number of tokens in vocabulary
-
bool hasToken(std::string const &token) const#
Check if token exists in vocabulary.
- Parameters:
token – Token string to check
- Returns:
true if token exists, false otherwise
-
Rank getTokenRank(std::string const &token) const#
Get token rank from token string.
- Parameters:
token – Token string
- Returns:
Token rank/ID
-
std::string getRankToken(Rank rank) const#
Get token string from rank.
- Parameters:
rank – Token rank/ID
- Returns:
Token string
-
TokenEncoder(Type type = BPE)#