Tokenizer#

class Tokenizer#

Tokenizer class for encoding and decoding text.

Provides tokenization functionality including pretokenization, encoding, and decoding. Supports loading from HuggingFace model directories.

Public Functions

Tokenizer() noexcept#

~Tokenizer() noexcept = default#

std::vector<Rank> encode( std::string const &text, bool addBos = false, bool addEos = false ) const#

Encode text to token IDs.

Parameters:

text – Input text to encode
addBos – Whether to add beginning-of-sequence token
addEos – Whether to add end-of-sequence token

Throws:

std::runtime_error – if tokenization encounters an error

Returns:

Vector of token IDs

std::string decode( std::vector<Rank> const &tokens, bool skipSpecialTokens = false ) const#

Decode token IDs back to text.

Parameters:

tokens – Vector of token IDs
skipSpecialTokens – Whether to skip special tokens in output

Returns:

Decoded text string (well-formed UTF-8; invalid byte sequences are replaced with U+FFFD via sanitizeUtf8Streaming/Flush)

std::string idToPiece(Rank token, bool skipSpecialTokens = true) const#

Single-token piece lookup for the streaming hot path.

Forwards to TokenEncoder::getRankToken with the skip-special policy applied. Returns an empty string when the token is a special token and skipSpecialTokens=true, or when rank is unknown (matches the silent-skip semantics of decode()).

Parameters:

token – Token ID (Rank).
skipSpecialTokens – Skip special tokens (BOS/EOS/etc).

Returns:

Raw piece bytes (possibly not independently valid UTF-8) or “”.

bool loadFromHF(std::filesystem::path const &modelDir)#

Load tokenizer from HuggingFace model directory.

Parameters:: modelDir – Path to the model directory containing tokenizer files
Returns:: true if directory exists, tokenizer.json is found and parsed successfully, pretokenizer and encoder are created successfully; false if directory doesn’t exist, tokenizer.json is missing/corrupt, or initialization fails

inline int getNumVocab() const noexcept#

Get total vocabulary size.

Returns:: Number of tokens in vocabulary

inline Rank getBosId() const noexcept#

Get beginning-of-sequence token ID.

Returns:: BOS token ID

inline Rank getEosId() const noexcept#

Get end-of-sequence token ID.

Returns:: EOS token ID

inline Rank getPadId() const noexcept#

Get padding token ID.

Returns:: PAD token ID (returns EOS if PAD is not set)

inline Rank getUnkId() const noexcept#

Get unknown token ID.

Returns:: UNK token ID

inline Rank getTokenId(std::string const &token) const noexcept#

Get token ID by token string.

Parameters:: token – Token string to look up
Returns:: Token ID, or -1 if not found

bool isInitialized() const noexcept#

Check if tokenizer is properly initialized.

Returns:: true if initialized, false otherwise

bool loadChatTemplate(std::filesystem::path const &chatTemplateFile)#

Load chat template configuration from JSON file.

Parameters:: chatTemplateFile – Path to the processed_chat_template.json file
Returns:: true if chat template is loaded successfully; false if file doesn’t exist or parsing fails

bool applyChatTemplate( rt::LLMGenerationRequest::Request const &request, rt::LLMGenerationRequest::FormattedRequest &formattedRequest, bool applyChatTemplate = true, bool addGenerationPrompt = true, bool enableThinking = false ) const#

Apply chat template to a request.

Parameters:

request – Request object containing messages
formattedRequest – Output formatted request object that will be populated
applyChatTemplate – Whether to apply full chat template formatting (with special tokens) or raw concatenation
addGenerationPrompt – Whether to add generation prompt at the end (only used when applyChatTemplate is true)
enableThinking – Whether to enable thinking mode for models that support it

Returns:

true if chat template is applied successfully; false if encountered errors

inline TokenToRanks const &getSpecialTokensEncoder() const noexcept#: Special token string → id map (e.g. Alpamayo trajectory placeholder expansion).

inline std::string getDefaultSystemPrompt() const noexcept#

Get default system prompt from chat template.

Returns:: Default system prompt string

struct ChatTemplateRole#

Chat template role configuration.

Public Members

std::string prefix#: Prefix for this role.

std::string suffix#: Suffix for this role.

struct ChatTemplateContentType#

Chat template content type configuration.

Public Members

std::string format#: Format string for this content type.

struct ChatTemplateConfig#

Chat template configuration.

Public Members

std::string modelPath#: Model path or identifier.

std::unordered_map<std::string, ChatTemplateRole> roles#: Role configurations (system, user, assistant)

std::unordered_map<std::string, ChatTemplateContentType> contentTypes#: Content type configurations (text, image, video)

std::string generationPrompt#: Standard generation prompt (thinking disabled)

std::string generationPromptThinking#: Generation prompt with thinking enabled (optional, model-specific)

std::string defaultSystemPrompt#: Default system prompt.

struct textPartition#

Text partition representation for tokenization.

Represents either a special token or a raw text segment to be tokenized.

Public Functions

inline textPartition(Rank _token) noexcept#

Constructor for special token partition.

Parameters:: _token – Token ID for the special token

inline textPartition( std::string const &_rawText, int _offset, int _length )#

Constructor for raw text partition.

Parameters:

_rawText – Reference to the raw text string
_offset – Offset into the raw text string
_length – Length of the text partition

Public Members

const TEXT_PART_TYPE type#: Type of partition (special token or raw text)

Rank const token#: Token ID (valid when type is TEXT_PART_SPECIAL_TOKEN)

std::string const _dummy#: Dummy string for special token partitions.

std::string const &rawText#: Reference to raw text (valid when type is TEXT_PART_RAW_TEXT)

int const offset#: Offset into rawText.

int const length#: Length of the partition.

std::string trt_edgellm::tokenizer::emitDelta( rt::SlotStreamState &s, Tokenizer const &tok, std::vector<int32_t> const &allTokenIds, bool skipSpecial )#

Streaming emit: consume newly-arrived token ids and produce valid UTF-8 delta text.

Looks up the piece bytes for each token in allTokenIds[s.sentTokenCount..end), prepends s.pendingBytes, and passes the concatenated buffer through sanitizeUtf8Streaming. Invalid byte sequences become U+FFFD; trailing incomplete codepoints are held in s.pendingBytes for the next call. s.sentTokenCount is advanced unconditionally.

Contract:

Must be called once per iteration per slot, after the iteration has appended its tokens.
Output is always well-formed UTF-8.
Works for both vanilla (1 new token) and spec-decode (N new tokens) paths.

Parameters:

s – Slot state — modified in place.
tok – Tokenizer used for piece lookup (Tokenizer::idToPiece).
allTokenIds – Full token id sequence for this slot.
skipSpecial – Skip special tokens (true for consumer-facing streaming).

Returns:

Delta text (may be empty if all new bytes were held as incomplete).

std::string trt_edgellm::tokenizer::emitDeltaFlush( rt::SlotStreamState &s )#

Final-iteration flush: convert any held incomplete bytes to U+FFFD and clear.

Called once per slot at finish time when finishedStates[i] flips to 1. Output is well-formed UTF-8 (one U+FFFD per held byte).