nemo_utils

The utils to support Nemo models.

Classes

CustomSentencePieceTokenizer

Custom tokenizer based on Nemo SentencePieceTokenizer.

Functions

get_nemo_tokenizer

Build tokenizer from Nemo tokenizer config.

get_tokenzier

Loads the tokenizer from the decoded NEMO weights dir.

class CustomSentencePieceTokenizer

Bases: PreTrainedTokenizer

Custom tokenizer based on Nemo SentencePieceTokenizer.

This extension of SentencePieceTokenizer is to make API consistent with HuggingFace tokenizers in order to run evaluation tools in examples/tensorrt_llm/scripts/nemo_example.sh script.

__init__(*args, **kwargs)

Constructor method with extra check for non-legacy SentencePieceTokenizer variant.

batch_decode(ids, **kwargs)

Method introduced for HF tokenizers API consistency for evaluation scripts.

batch_encode_plus(texts, **kwargs)

Method introduced for HF tokenizers API consistency for evaluation scripts.

Note: kwargs are ignored.

decode(ids, **kwargs)

MMethod introduced for HF tokenizers API consistency for evaluation scripts.

Note: kwargs are ignored.

encode(text, return_tensors=None, max_length=None, **kwargs)

Method introduced for HF tokenizers API consistency for evaluation scripts.

Note: kwargs other than return_tensors and max_length are ignored.

property eos_token

eos_token.

property eos_token_id

eos_token_id.

property pad_token

pad_token.

property pad_token_id

pad_token_id.

get_nemo_tokenizer(tokenizer_cfg_path)

Build tokenizer from Nemo tokenizer config.

Refer to the logic of get_nmt_tokenizer function on how to instantiate tokenizers in Nemo, see https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/nlp/modules/common/tokenizer_utils.py.

Parameters:

tokenizer_cfg_path (str) –

get_tokenzier(tokenizer_dir_or_path)

Loads the tokenizer from the decoded NEMO weights dir.

Parameters:

tokenizer_dir_or_path (Path) –

Return type:

PreTrainedTokenizer