Tokenizer#

class Tokenizer#

Tokenizer class for encoding and decoding text.

Provides tokenization functionality including pretokenization, encoding, and decoding. Supports loading from HuggingFace model directories.

Public Functions

Tokenizer() noexcept#
~Tokenizer() noexcept = default#
std::vector<Rank> encode(
std::string const &text,
bool addBos = false,
bool addEos = false
) const#

Encode text to token IDs.

Parameters:
  • text – Input text to encode

  • addBos – Whether to add beginning-of-sequence token

  • addEos – Whether to add end-of-sequence token

Throws:

std::runtime_error – if tokenization encounters an error

Returns:

Vector of token IDs

std::string decode(
std::vector<Rank> const &tokens,
bool skipSpecialTokens = false
) const#

Decode token IDs back to text.

Parameters:
  • tokens – Vector of token IDs

  • skipSpecialTokens – Whether to skip special tokens in output

Returns:

Decoded text string (well-formed UTF-8; invalid byte sequences are replaced with U+FFFD via sanitizeUtf8Streaming/Flush)

std::string idToPiece(Rank token, bool skipSpecialTokens = true) const#

Single-token piece lookup for the streaming hot path.

Forwards to TokenEncoder::getRankToken with the skip-special policy applied. Returns an empty string when the token is a special token and skipSpecialTokens=true, or when rank is unknown (matches the silent-skip semantics of decode()).

Parameters:
  • token – Token ID (Rank).

  • skipSpecialTokens – Skip special tokens (BOS/EOS/etc).

Returns:

Raw piece bytes (possibly not independently valid UTF-8) or “”.

bool loadFromHF(std::filesystem::path const &modelDir)#

Load tokenizer from HuggingFace model directory.

Parameters:

modelDir – Path to the model directory containing tokenizer files

Returns:

true if directory exists, tokenizer.json is found and parsed successfully, pretokenizer and encoder are created successfully; false if directory doesn’t exist, tokenizer.json is missing/corrupt, or initialization fails

inline int getNumVocab() const noexcept#

Get total vocabulary size.

Returns:

Number of tokens in vocabulary

inline Rank getBosId() const noexcept#

Get beginning-of-sequence token ID.

Returns:

BOS token ID

inline Rank getEosId() const noexcept#

Get end-of-sequence token ID.

Returns:

EOS token ID

inline Rank getPadId() const noexcept#

Get padding token ID.

Returns:

PAD token ID (returns EOS if PAD is not set)

inline Rank getUnkId() const noexcept#

Get unknown token ID.

Returns:

UNK token ID

bool isInitialized() const noexcept#

Check if tokenizer is properly initialized.

Returns:

true if initialized, false otherwise

bool loadChatTemplate(std::filesystem::path const &chatTemplateFile)#

Load chat template configuration from JSON file.

Parameters:

chatTemplateFile – Path to the processed_chat_template.json file

Returns:

true if chat template is loaded successfully; false if file doesn’t exist or parsing fails

bool applyChatTemplate(
rt::LLMGenerationRequest::Request const &request,
rt::LLMGenerationRequest::FormattedRequest &formattedRequest,
bool applyChatTemplate = true,
bool addGenerationPrompt = true,
bool enableThinking = false
) const#

Apply chat template to a request.

Parameters:
  • request – Request object containing messages

  • formattedRequest – Output formatted request object that will be populated

  • applyChatTemplate – Whether to apply full chat template formatting (with special tokens) or raw concatenation

  • addGenerationPrompt – Whether to add generation prompt at the end (only used when applyChatTemplate is true)

  • enableThinking – Whether to enable thinking mode for models that support it

Returns:

true if chat template is applied successfully; false if encountered errors

inline std::string getDefaultSystemPrompt() const noexcept#

Get default system prompt from chat template.

Returns:

Default system prompt string

struct ChatTemplateRole#

Chat template role configuration.

Public Members

std::string prefix#

Prefix for this role.

std::string suffix#

Suffix for this role.

struct ChatTemplateContentType#

Chat template content type configuration.

Public Members

std::string format#

Format string for this content type.

struct ChatTemplateConfig#

Chat template configuration.

Public Members

std::string modelPath#

Model path or identifier.

std::unordered_map<std::string, ChatTemplateRole> roles#

Role configurations (system, user, assistant)

std::unordered_map<std::string, ChatTemplateContentType> contentTypes#

Content type configurations (text, image, video)

std::string generationPrompt#

Standard generation prompt (thinking disabled)

std::string generationPromptThinking#

Generation prompt with thinking enabled (optional, model-specific)

std::string defaultSystemPrompt#

Default system prompt.

struct textPartition#

Text partition representation for tokenization.

Represents either a special token or a raw text segment to be tokenized.

Public Functions

inline textPartition(Rank _token) noexcept#

Constructor for special token partition.

Parameters:

_token – Token ID for the special token

inline textPartition(
std::string const &_rawText,
int _offset,
int _length
)#

Constructor for raw text partition.

Parameters:
  • _rawText – Reference to the raw text string

  • _offset – Offset into the raw text string

  • _length – Length of the text partition

Public Members

const TEXT_PART_TYPE type#

Type of partition (special token or raw text)

Rank const token#

Token ID (valid when type is TEXT_PART_SPECIAL_TOKEN)

std::string const _dummy#

Dummy string for special token partitions.

std::string const &rawText#

Reference to raw text (valid when type is TEXT_PART_RAW_TEXT)

int const offset#

Offset into rawText.

int const length#

Length of the partition.

std::string trt_edgellm::tokenizer::emitDelta(
rt::SlotStreamState &s,
Tokenizer const &tok,
std::vector<int32_t> const &allTokenIds,
bool skipSpecial
)#

Streaming emit: consume newly-arrived token ids and produce valid UTF-8 delta text.

Looks up the piece bytes for each token in allTokenIds[s.sentTokenCount..end), prepends s.pendingBytes, and passes the concatenated buffer through sanitizeUtf8Streaming. Invalid byte sequences become U+FFFD; trailing incomplete codepoints are held in s.pendingBytes for the next call. s.sentTokenCount is advanced unconditionally.

Contract:

  • Must be called once per iteration per slot, after the iteration has appended its tokens.

  • Output is always well-formed UTF-8.

  • Works for both vanilla (1 new token) and spec-decode (N new tokens) paths.

Parameters:
  • s – Slot state — modified in place.

  • tokTokenizer used for piece lookup (Tokenizer::idToPiece).

  • allTokenIds – Full token id sequence for this slot.

  • skipSpecial – Skip special tokens (true for consumer-facing streaming).

Returns:

Delta text (may be empty if all new bytes were held as incomplete).

std::string trt_edgellm::tokenizer::emitDeltaFlush(
rt::SlotStreamState &s
)#

Final-iteration flush: convert any held incomplete bytes to U+FFFD and clear.

Called once per slot at finish time when finishedStates[i] flips to 1. Output is well-formed UTF-8 (one U+FFFD per held byte).