nemo_utils
The utils to support Nemo models.
Classes
Custom tokenizer based on Nemo SentencePieceTokenizer. |
Functions
Build tokenizer from Nemo tokenizer config. |
|
Loads the tokenizer from the decoded NEMO weights dir. |
- class CustomSentencePieceTokenizer
Bases:
PreTrainedTokenizer
Custom tokenizer based on Nemo SentencePieceTokenizer.
This extension of SentencePieceTokenizer is to make API consistent with HuggingFace tokenizers in order to run evaluation tools in examples/tensorrt_llm/scripts/nemo_example.sh script.
- __init__(*args, **kwargs)
Constructor method with extra check for non-legacy SentencePieceTokenizer variant.
- batch_decode(ids, **kwargs)
Method introduced for HF tokenizers API consistency for evaluation scripts.
- batch_encode_plus(texts, **kwargs)
Method introduced for HF tokenizers API consistency for evaluation scripts.
Note: kwargs are ignored.
- decode(ids, **kwargs)
MMethod introduced for HF tokenizers API consistency for evaluation scripts.
Note: kwargs are ignored.
- encode(text, return_tensors=None, max_length=None, **kwargs)
Method introduced for HF tokenizers API consistency for evaluation scripts.
Note: kwargs other than return_tensors and max_length are ignored.
- property eos_token
eos_token.
- property eos_token_id
eos_token_id.
- property pad_token
pad_token.
- property pad_token_id
pad_token_id.
- get_nemo_tokenizer(tokenizer_cfg_path)
Build tokenizer from Nemo tokenizer config.
Refer to the logic of get_nmt_tokenizer function on how to instantiate tokenizers in Nemo, see https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/nlp/modules/common/tokenizer_utils.py.
- Parameters:
tokenizer_cfg_path (str) –
- get_tokenzier(tokenizer_dir_or_path)
Loads the tokenizer from the decoded NEMO weights dir.
- Parameters:
tokenizer_dir_or_path (Path) –
- Return type:
PreTrainedTokenizer