Tokenizer Utils#

struct codepointFlags#

Unicode codepoint flags.

Bitfield structure for Unicode codepoint properties.

Public Types

enum CategoryFlags#

Category flag constants.

Values:

enumerator UNDEFINED = 0x0001#

Undefined category.

enumerator NUMBER = 0x0002#

Number category (\p{N})

enumerator LETTER = 0x0004#

Letter category (\p{L})

enumerator SEPARATOR = 0x0008#

Separator category (\p{Z})

enumerator ACCENT_MARK = 0x0010#

Accent mark category (\p{M})

enumerator PUNCTUATION = 0x0020#

Punctuation category (\p{P})

enumerator SYMBOL = 0x0040#

Symbol category (\p{S})

enumerator CONTROL = 0x0080#

Control character category (\p{C})

enumerator MASK_CATEGORIES = 0x00FF#

Mask for category flags.

Public Functions

inline codepointFlags(uint16_t const flags = 0) noexcept#

Construct from uint16 flags.

Parameters:

flags – Flag value

inline uint16_t asUint() const noexcept#

Convert to uint16.

Returns:

Flags as uint16

inline uint16_t categoryFlag() const noexcept#

Get category flag.

Returns:

Category flag value

Public Members

uint16_t isUndefined#

Is undefined.

uint16_t isNumber#

Is number (\p{N})

uint16_t isLetter#

Is letter (\p{L})

uint16_t isSeparator#

Is separator (\p{Z})

uint16_t isAccentMark#

Is accent mark (\p{M})

uint16_t isPunctuation#

Is punctuation (\p{P})

uint16_t isSymbol#

Is symbol (\p{S})

uint16_t isControl#

Is control character (\p{C})

uint16_t isWhitespace#

Is whitespace (\s)

uint16_t isLowercase#

Is lowercase.

uint16_t isUppercase#

Is uppercase.

uint16_t isNfd#

Has NFD form.