Tokenizer#

class Tokenizer#

Tokenizer class for encoding and decoding text.

Provides tokenization functionality including pretokenization, encoding, and decoding. Supports loading from HuggingFace model directories.

Public Functions

Tokenizer()#
~Tokenizer() = default#
std::vector<Rank> encode(
std::string const &text,
bool addBos = false,
bool addEos = false
) const#

Encode text to token IDs.

Parameters:
  • text – Input text to encode

  • addBos – Whether to add beginning-of-sequence token

  • addEos – Whether to add end-of-sequence token

Returns:

Vector of token IDs

std::string decode(
std::vector<Rank> const &tokens,
bool skipSpecialTokens = false
) const#

Decode token IDs back to text.

Parameters:
  • tokens – Vector of token IDs

  • skipSpecialTokens – Whether to skip special tokens in output

Returns:

Decoded text string

bool loadFromHF(std::filesystem::path const &modelDir)#

Load tokenizer from HuggingFace model directory.

Parameters:

modelDir – Path to the model directory containing tokenizer files

Returns:

true if directory exists, tokenizer.json is found and parsed successfully, pretokenizer and encoder are created successfully; false if directory doesn’t exist, tokenizer.json is missing/corrupt, or initialization fails

inline int getNumVocab() const noexcept#

Get total vocabulary size.

Returns:

Number of tokens in vocabulary

inline Rank getBosId() const noexcept#

Get beginning-of-sequence token ID.

Returns:

BOS token ID

inline Rank getEosId() const noexcept#

Get end-of-sequence token ID.

Returns:

EOS token ID

inline Rank getPadId() const noexcept#

Get padding token ID.

Returns:

PAD token ID (returns EOS if PAD is not set)

inline Rank getUnkId() const noexcept#

Get unknown token ID.

Returns:

UNK token ID

bool isInitialized() const noexcept#

Check if tokenizer is properly initialized.

Returns:

true if initialized, false otherwise

bool loadChatTemplate(std::filesystem::path const &chatTemplateFile)#

Load chat template configuration from JSON file.

Parameters:

chatTemplateFile – Path to the processed_chat_template.json file

Returns:

true if chat template is loaded successfully; false if file doesn’t exist or parsing fails

bool applyChatTemplate(
rt::LLMGenerationRequest::Request const &request,
rt::LLMGenerationRequest::FormattedRequest &formattedRequest,
bool applyChatTemplate = true,
bool addGenerationPrompt = true,
bool enableThinking = false
) const#

Apply chat template to a request.

Parameters:
  • request – Request object containing messages

  • formattedRequest – Output formatted request object that will be populated

  • applyChatTemplate – Whether to apply full chat template formatting (with special tokens) or raw concatenation

  • addGenerationPrompt – Whether to add generation prompt at the end (only used when applyChatTemplate is true)

  • enableThinking – Whether to enable thinking mode for models that support it

Returns:

true if chat template is applied successfully; false if encountered errors

inline std::string getDefaultSystemPrompt() const noexcept#

Get default system prompt from chat template.

Returns:

Default system prompt string

struct ChatTemplateRole#

Chat template role configuration.

Public Members

std::string prefix#

Prefix for this role.

std::string suffix#

Suffix for this role.

struct ChatTemplateContentType#

Chat template content type configuration.

Public Members

std::string format#

Format string for this content type.

struct ChatTemplateConfig#

Chat template configuration.

Public Members

std::string modelPath#

Model path or identifier.

std::unordered_map<std::string, ChatTemplateRole> roles#

Role configurations (system, user, assistant)

std::unordered_map<std::string, ChatTemplateContentType> contentTypes#

Content type configurations (text, image, video)

std::string generationPrompt#

Standard generation prompt (thinking disabled)

std::string generationPromptThinking#

Generation prompt with thinking enabled (optional, model-specific)

std::string defaultSystemPrompt#

Default system prompt.

struct textPartition#

Text partition representation for tokenization.

Represents either a special token or a raw text segment to be tokenized.

Public Functions

inline textPartition(Rank _token)#

Constructor for special token partition.

Parameters:

_token – Token ID for the special token

inline textPartition(
std::string const &_rawText,
int _offset,
int _length
)#

Constructor for raw text partition.

Parameters:
  • _rawText – Reference to the raw text string

  • _offset – Offset into the raw text string

  • _length – Length of the text partition

Public Members

const TEXT_PART_TYPE type#

Type of partition (special token or raw text)

Rank const token#

Token ID (valid when type is TEXT_PART_SPECIAL_TOKEN)

std::string const _dummy#

Dummy string for special token partitions.

std::string const &rawText#

Reference to raw text (valid when type is TEXT_PART_RAW_TEXT)

int const offset#

Offset into rawText.

int const length#

Length of the partition.