Utf8#
-
int trt_edgellm::utf8::leaderByteLen(unsigned char c) noexcept#
Length (1–4) of a UTF-8 codepoint starting with this leader byte.
Returns 0 if
cis not a valid leader — either a continuation byte (10xxxxxx), a 5+ byte leader (11111xxx), or other malformed leading bit pattern. Used by both the streaming sanitizer and the tokenizer codepoint decoder.
- uint32_t trt_edgellm::utf8::decodeCodepoint(
- unsigned char const *bytes,
- int need
Decode a UTF-8 codepoint from
bytes[0..need).Preconditions (caller must verify): (1)
need == leaderByteLen(bytes[0])andneed > 0; (2) every continuation byte matches(b & 0xC0) == 0x80.Does NOT validate overlongs, UTF-16 surrogates, or codepoints > U+10FFFF — use
isValidCodepointForLenfor that.
- bool trt_edgellm::utf8::isValidCodepointForLen(
- uint32_t cp,
- int need
True iff
cpis validly encoded as aneed-byte UTF-8 codepoint.Rejects overlongs for the given length, UTF-16 surrogates (U+D800..U+DFFF), and codepoints > U+10FFFF.
needmust be in [1, 4].
- std::string trt_edgellm::utf8::sanitizeUtf8Streaming(
- std::string const &buffer,
- std::string &pending
Consume a raw byte buffer and produce a valid UTF-8 string.
Scans
bufferand produces a well-formed UTF-8 output string. Invalid byte sequences (isolated continuation bytes, overlong encodings, surrogates, codepoints > U+10FFFF, bogus leaders) are replaced with the Unicode replacement character U+FFFD (“\xEF\xBF\xBD”).If the buffer ends mid-codepoint (valid leader but insufficient continuation bytes), the trailing incomplete bytes are moved into
pendingfor reuse on the next call and NOT emitted. This is the only case in which bytes are held.pendingis an in-out buffer: existing content is prepended tobufferat the start of each call, and replaced with the new trailing incomplete bytes (if any) on return.Output always equals input in terms of Unicode codepoints modulo:
trailing incomplete bytes (moved to
pending)invalid byte sequences (replaced with U+FFFD)
- Parameters:
buffer – Input bytes to sanitize.
pending – In-out buffer: leftover incomplete bytes from a previous call are prepended to
buffer; new trailing incomplete bytes (if any) are written back topendingon return.
- Returns:
Well-formed UTF-8 string with invalid byte sequences replaced.
- std::string trt_edgellm::utf8::sanitizeUtf8Flush(
- std::string &pending
Final-flush variant.
Emits all of
pendingas U+FFFD replacement characters (one per held byte) and clearspending. Used when the slot terminates with bytes still in-flight (e.g., model emitted EOS mid-codepoint) or for single-shot decode paths that have no further input to arrive.- Parameters:
pending – In-out buffer: cleared on return.
- Returns:
String of U+FFFD codepoints, one per byte previously held in pending.