Tokenizer#

class Tokenizer#

Tokenizer class for encoding and decoding text.

Provides tokenization functionality including pretokenization, encoding, and decoding. Supports loading from HuggingFace model directories.

Public Functions

Tokenizer()#

~Tokenizer() = default#

std::vector<Rank> encode( std::string const &text, bool addBos = false, bool addEos = false ) const#

Encode text to token IDs.

Parameters:

text – Input text to encode
addBos – Whether to add beginning-of-sequence token
addEos – Whether to add end-of-sequence token

Returns:

Vector of token IDs

std::string decode( std::vector<Rank> const &tokens, bool skipSpecialTokens = false ) const#

Decode token IDs back to text.

Parameters:

tokens – Vector of token IDs
skipSpecialTokens – Whether to skip special tokens in output

Returns:

Decoded text string

bool loadFromHF(std::filesystem::path const &modelDir)#

Load tokenizer from HuggingFace model directory.

Parameters:: modelDir – Path to the model directory containing tokenizer files
Returns:: true if directory exists, tokenizer.json is found and parsed successfully, pretokenizer and encoder are created successfully; false if directory doesn’t exist, tokenizer.json is missing/corrupt, or initialization fails

inline int getNumVocab() const noexcept#

Get total vocabulary size.

Returns:: Number of tokens in vocabulary

inline Rank getBosId() const noexcept#

Get beginning-of-sequence token ID.

Returns:: BOS token ID

inline Rank getEosId() const noexcept#

Get end-of-sequence token ID.

Returns:: EOS token ID

inline Rank getPadId() const noexcept#

Get padding token ID.

Returns:: PAD token ID (returns EOS if PAD is not set)

inline Rank getUnkId() const noexcept#

Get unknown token ID.

Returns:: UNK token ID

bool isInitialized() const noexcept#

Check if tokenizer is properly initialized.

Returns:: true if initialized, false otherwise

bool loadChatTemplate(std::filesystem::path const &chatTemplateFile)#

Load chat template configuration from JSON file.

Parameters:: chatTemplateFile – Path to the processed_chat_template.json file
Returns:: true if chat template is loaded successfully; false if file doesn’t exist or parsing fails

bool applyChatTemplate( rt::LLMGenerationRequest::Request const &request, rt::LLMGenerationRequest::FormattedRequest &formattedRequest, bool applyChatTemplate = true, bool addGenerationPrompt = true, bool enableThinking = false ) const#

Apply chat template to a request.

Parameters:

request – Request object containing messages
formattedRequest – Output formatted request object that will be populated
applyChatTemplate – Whether to apply full chat template formatting (with special tokens) or raw concatenation
addGenerationPrompt – Whether to add generation prompt at the end (only used when applyChatTemplate is true)
enableThinking – Whether to enable thinking mode for models that support it

Returns:

true if chat template is applied successfully; false if encountered errors

inline std::string getDefaultSystemPrompt() const noexcept#

Get default system prompt from chat template.

Returns:: Default system prompt string

struct ChatTemplateRole#

Chat template role configuration.

Public Members

std::string prefix#: Prefix for this role.

std::string suffix#: Suffix for this role.

struct ChatTemplateContentType#

Chat template content type configuration.

Public Members

std::string format#: Format string for this content type.

struct ChatTemplateConfig#

Chat template configuration.

Public Members

std::string modelPath#: Model path or identifier.

std::unordered_map<std::string, ChatTemplateRole> roles#: Role configurations (system, user, assistant)

std::unordered_map<std::string, ChatTemplateContentType> contentTypes#: Content type configurations (text, image, video)

std::string generationPrompt#: Standard generation prompt (thinking disabled)

std::string generationPromptThinking#: Generation prompt with thinking enabled (optional, model-specific)

std::string defaultSystemPrompt#: Default system prompt.

struct textPartition#

Text partition representation for tokenization.

Represents either a special token or a raw text segment to be tokenized.

Public Functions

inline textPartition(Rank _token)#

Constructor for special token partition.

Parameters:: _token – Token ID for the special token

inline textPartition( std::string const &_rawText, int _offset, int _length )#

Constructor for raw text partition.

Parameters:

_rawText – Reference to the raw text string
_offset – Offset into the raw text string
_length – Length of the text partition

Public Members

const TEXT_PART_TYPE type#: Type of partition (special token or raw text)

Rank const token#: Token ID (valid when type is TEXT_PART_SPECIAL_TOKEN)

std::string const _dummy#: Dummy string for special token partitions.

std::string const &rawText#: Reference to raw text (valid when type is TEXT_PART_RAW_TEXT)

int const offset#: Offset into rawText.

int const length#: Length of the partition.