Tokenizer#
-
class Tokenizer#
Tokenizer class for encoding and decoding text.
Provides tokenization functionality including pretokenization, encoding, and decoding. Supports loading from HuggingFace model directories.
Public Functions
-
Tokenizer()#
-
~Tokenizer() = default#
- std::vector<Rank> encode(
- std::string const &text,
- bool addBos = false,
- bool addEos = false
Encode text to token IDs.
- Parameters:
text – Input text to encode
addBos – Whether to add beginning-of-sequence token
addEos – Whether to add end-of-sequence token
- Returns:
Vector of token IDs
- std::string decode(
- std::vector<Rank> const &tokens,
- bool skipSpecialTokens = false
Decode token IDs back to text.
- Parameters:
tokens – Vector of token IDs
skipSpecialTokens – Whether to skip special tokens in output
- Returns:
Decoded text string
-
bool loadFromHF(std::filesystem::path const &modelDir)#
Load tokenizer from HuggingFace model directory.
- Parameters:
modelDir – Path to the model directory containing tokenizer files
- Returns:
true if directory exists, tokenizer.json is found and parsed successfully, pretokenizer and encoder are created successfully; false if directory doesn’t exist, tokenizer.json is missing/corrupt, or initialization fails
-
inline int getNumVocab() const noexcept#
Get total vocabulary size.
- Returns:
Number of tokens in vocabulary
-
inline Rank getBosId() const noexcept#
Get beginning-of-sequence token ID.
- Returns:
BOS token ID
-
inline Rank getEosId() const noexcept#
Get end-of-sequence token ID.
- Returns:
EOS token ID
-
inline Rank getPadId() const noexcept#
Get padding token ID.
- Returns:
PAD token ID (returns EOS if PAD is not set)
-
inline Rank getUnkId() const noexcept#
Get unknown token ID.
- Returns:
UNK token ID
-
bool isInitialized() const noexcept#
Check if tokenizer is properly initialized.
- Returns:
true if initialized, false otherwise
-
bool loadChatTemplate(std::filesystem::path const &chatTemplateFile)#
Load chat template configuration from JSON file.
- Parameters:
chatTemplateFile – Path to the processed_chat_template.json file
- Returns:
true if chat template is loaded successfully; false if file doesn’t exist or parsing fails
- bool applyChatTemplate(
- rt::LLMGenerationRequest::Request const &request,
- rt::LLMGenerationRequest::FormattedRequest &formattedRequest,
- bool applyChatTemplate = true,
- bool addGenerationPrompt = true,
- bool enableThinking = false
Apply chat template to a request.
- Parameters:
request – Request object containing messages
formattedRequest – Output formatted request object that will be populated
applyChatTemplate – Whether to apply full chat template formatting (with special tokens) or raw concatenation
addGenerationPrompt – Whether to add generation prompt at the end (only used when applyChatTemplate is true)
enableThinking – Whether to enable thinking mode for models that support it
- Returns:
true if chat template is applied successfully; false if encountered errors
-
inline std::string getDefaultSystemPrompt() const noexcept#
Get default system prompt from chat template.
- Returns:
Default system prompt string
-
Tokenizer()#
-
struct ChatTemplateRole#
Chat template role configuration.
-
struct ChatTemplateContentType#
Chat template content type configuration.
Public Members
-
std::string format#
Format string for this content type.
-
std::string format#
-
struct ChatTemplateConfig#
Chat template configuration.
Public Members
-
std::string modelPath#
Model path or identifier.
-
std::unordered_map<std::string, ChatTemplateRole> roles#
Role configurations (system, user, assistant)
-
std::unordered_map<std::string, ChatTemplateContentType> contentTypes#
Content type configurations (text, image, video)
-
std::string generationPrompt#
Standard generation prompt (thinking disabled)
-
std::string generationPromptThinking#
Generation prompt with thinking enabled (optional, model-specific)
-
std::string defaultSystemPrompt#
Default system prompt.
-
std::string modelPath#
-
struct textPartition#
Text partition representation for tokenization.
Represents either a special token or a raw text segment to be tokenized.
Public Functions
-
inline textPartition(Rank _token)#
Constructor for special token partition.
- Parameters:
_token – Token ID for the special token
- inline textPartition(
- std::string const &_rawText,
- int _offset,
- int _length
Constructor for raw text partition.
- Parameters:
_rawText – Reference to the raw text string
_offset – Offset into the raw text string
_length – Length of the text partition
Public Members
-
const TEXT_PART_TYPE type#
Type of partition (special token or raw text)
-
Rank const token#
Token ID (valid when type is TEXT_PART_SPECIAL_TOKEN)
-
std::string const _dummy#
Dummy string for special token partitions.
-
std::string const &rawText#
Reference to raw text (valid when type is TEXT_PART_RAW_TEXT)
-
int const offset#
Offset into rawText.
-
int const length#
Length of the partition.
-
inline textPartition(Rank _token)#