generate

A wrapper over the TensorRT-LLM high level API runner.

Classes

A wrapper over the tensorrt_llm.llmapi.llm.LLM for LLM profiling and validation.

class LLM

Bases: LLM

A wrapper over the tensorrt_llm.llmapi.llm.LLM for LLM profiling and validation.

__init__(checkpoint_dir, tokenizer=None, kv_cache_config={}, medusa_choices=None, tp=0, trust_remote_code=False, max_batch_size=0)

Initializes the LLM runner class.

Parameters:

engine_dir – the directory path of the TensorRT-LLM engine.
tokenizer (str | Path | TokenizerBase | None) – the tokenizer. For example, a tokenizer from the Huggingface model.
kv_cache_config (dict[str, int | float]) – the kv cache config as a dict. Please refer to https://nvidia.github.io/TensorRT-LLM/performance/performance-tuning-guide/
medusa_choices (Any) – The medusa choices for the decoding config.
tp (int) – the tensor parallel size (for the torch backend). If 0, it will be set to the number of GPUs.
trust_remote_code (bool) – whether to trust the remote code (for the torch backend).
max_batch_size (int) – Max batch size for the LLM backend. If 0, it will be set to the max batch size in the engine config.
checkpoint_dir (str | Path)

property gather_context_logits: Returns whether the context_logits can be returned from the LLM instance.

generate_context_logits(prompts, temperature=1.0, top_p=None)

Generates the context logits based on the input prompts.

Parameters:

prompts (Iterable[str] | Iterable[list[int]]) – The input prompts. Could be a list of strings or token lists.
temperature (float) – The sampling temperature.
top_p (float | None) – The nucleus sampling parameter.

Returns:

a tensor list of the context_logits.

Return type:

list[tensor]

generate_text(prompts, max_new_tokens, temperature=1.0, top_p=None, stop_words=None)

Generates the text based on the input prompts.

Parameters:

prompts (Iterable[str] | Iterable[list[int]]) – The input prompts. Could be a list of strings or token lists.
max_new_tokens (int) – The max output token length.
temperature (float) – The sampling temperature
stop_words (list[str] | None) – A list of words the generate will stop on.
top_p (float | None)

Returns:

a list of output text strings if max_beam_width is 1 or a 2D list with shape [batch, beam].

Return type:

list[str] | list[list[str]]

generate_tokens(prompts, max_new_tokens, temperature=1.0, top_p=None, stop_words=None)

Generates the tokens based on the input prompts.

Parameters:

prompts (Iterable[str] | Iterable[list[int]]) – The input prompts. Could be a list of strings or token lists.
max_new_tokens (int) – The max output token length.
temperature (float) – The sampling temperature.
top_p (float | None) – The nucleus sampling parameter.
stop_words (list[str] | None) – A list of words that the generate stops on.

Returns:

a list of output token lists if max_beam_width is 1 or a 3D list with shape [batch, beam, sequence_len].

Return type:

list[list[int]] | list[list[list[int]]]