generate
A wrapper over the TensorRT-LLM high level API runner.
Classes
A wrapper over the |
- class LLM
Bases:
LLM
A wrapper over the
tensorrt_llm.hlapi.llm.LLM
for LLM profiling and validation.- __init__(engine_dir, tokenizer=None, kv_cache_config={}, medusa_choices=None)
Initializes the LLM runner class.
- Parameters:
engine_dir (str | Path) – the directory path of the TensorRT-LLM engine.
tokenizer (str | Path | tensorrt_llm.hlapi.tokenizer.TokenizerBase | None) – the tokenizer. For example, a tokenizer from the Huggingface model.
kv_cache_config (Dict[str, int | float]) – the kv cache config as a dict. Please refer to https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance/perf-best-practices.md
medusa_choices (Any) –
- generate_context_logits(prompts, temperature=1.0, top_p=None)
Generates the context logits based on the input prompts.
- Parameters:
prompts (Iterable[str] | Iterable[List[int]]) – The input prompts. Could be a list of strings or token lists.
temperature (float) – The sampling temperature.
top_p (float) – The nucleus sampling parameter.
keep_input_prompt – Set to include input prommpts in the outputs.
- Returns:
a tensor list of the context_logits.
- Return type:
List[tensor]
- generate_text(prompts, max_new_tokens, temperature=1.0, top_p=None, keep_input_prompt=True, stop_words=None)
Generates the text based on the input prompts.
- Parameters:
prompts (Iterable[str] | Iterable[List[int]]) – The input prompts. Could be a list of strings or token lists.
max_new_tokens (int) – The max output token length.
temperature (float) – The sampling temperature
keep_input_prompt (bool) – Set to include input prommpts in the outputs.
stop_words (List[str]) – A list of words the generate will stop on.
top_p (float) –
- Returns:
a list of output text strings if max_beam_width is 1 or a 2D list with shape [batch, beam].
- Return type:
List[str] | List[List[str]]
- generate_tokens(prompts, max_new_tokens, temperature=1.0, top_p=None, keep_input_prompt=True, stop_words=None)
Generates the tokens based on the input prompts.
- Parameters:
prompts (Iterable[str] | Iterable[List[int]]) – The input prompts. Could be a list of strings or token lists.
max_new_tokens (int) – The max output token length.
temperature (float) – The sampling temperature.
top_p (float) – The nucleus sampling parameter.
keep_input_prompt (bool) – Set to include input prommpts in the outputs.
stop_words (List[str]) – A list of words that the generate stops on.
- Returns:
a list of output token lists if max_beam_width is 1 or a 3D list with shape [batch, beam, sequence_len].
- Return type:
List[List[int]] | List[List[List[int]]]
- property max_beam_width
Get the max beam width from the LLM instance.
- property max_input_len
Get the max input length from the LLM instance.