nemo_utils

The utils to support Nemo models.

Classes

CustomSentencePieceTokenizer

Custom tokenizer based on Nemo SentencePieceTokenizer.

Functions

`get_nemo_tokenizer`	Build tokenizer from Nemo tokenizer config.
`get_tokenzier`	Loads the tokenizer from the decoded NEMO weights dir.

class CustomSentencePieceTokenizer

Bases: PreTrainedTokenizer

Custom tokenizer based on Nemo SentencePieceTokenizer.

This extension of SentencePieceTokenizer is to make API consistent with HuggingFace tokenizers in order to run evaluation tools in examples/tensorrt_llm/scripts/nemo_example.sh script.

__init__(*args, **kwargs): Constructor method with extra check for non-legacy SentencePieceTokenizer variant.

batch_decode(ids, **kwargs): Method introduced for HF tokenizers API consistency for evaluation scripts.

batch_encode_plus(texts, **kwargs)

Method introduced for HF tokenizers API consistency for evaluation scripts.

Note: kwargs are ignored.

decode(ids, **kwargs)

MMethod introduced for HF tokenizers API consistency for evaluation scripts.

Note: kwargs are ignored.

encode(text, return_tensors=None, max_length=None, **kwargs)

Method introduced for HF tokenizers API consistency for evaluation scripts.

Note: kwargs other than return_tensors and max_length are ignored.

property eos_token: eos_token.

property eos_token_id: eos_token_id.

property pad_token: pad_token.

property pad_token_id: pad_token_id.

get_nemo_tokenizer(tokenizer_cfg_path)

Build tokenizer from Nemo tokenizer config.

Refer to the logic of get_nmt_tokenizer function on how to instantiate tokenizers in Nemo, see https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/nlp/modules/common/tokenizer_utils.py.

Parameters:: tokenizer_cfg_path (str)

get_tokenzier(tokenizer_dir_or_path)

Loads the tokenizer from the decoded NEMO weights dir.

Parameters:: tokenizer_dir_or_path (Path)
Return type:: PreTrainedTokenizer