generate

A wrapper over the TensorRT-LLM high level API runner.

Classes

LLM

A wrapper over the tensorrt_llm.llmapi.llm.LLM for LLM profiling and validation.

class LLM

Bases: LLM

A wrapper over the tensorrt_llm.llmapi.llm.LLM for LLM profiling and validation.

__init__(checkpoint_dir, tokenizer=None, kv_cache_config={}, medusa_choices=None, tp=0, trust_remote_code=False)

Initializes the LLM runner class.

Parameters:
  • engine_dir – the directory path of the TensorRT-LLM engine.

  • tokenizer (str | Path | tensorrt_llm.llmapi.tokenizer.TokenizerBase | None) – the tokenizer. For example, a tokenizer from the Huggingface model.

  • kv_cache_config (dict[str, int | float]) – the kv cache config as a dict. Please refer to https://nvidia.github.io/TensorRT-LLM/performance/performance-tuning-guide/

  • medusa_choices (Any) – The medusa choices for the decoding config.

  • tp (int) – the tensor parallel size (for the torch backend). If 0, it will be set to the number of GPUs.

  • trust_remote_code (bool) – whether to trust the remote code (for the torch backend).

  • checkpoint_dir (str | Path) –

property gather_context_logits

Returns whether the context_logits can be returned from the LLM instance.

generate_context_logits(prompts, temperature=1.0, top_p=None)

Generates the context logits based on the input prompts.

Parameters:
  • prompts (Iterable[str] | Iterable[list[int]]) – The input prompts. Could be a list of strings or token lists.

  • temperature (float) – The sampling temperature.

  • top_p (float) – The nucleus sampling parameter.

Returns:

a tensor list of the context_logits.

Return type:

list[tensor]

generate_text(prompts, max_new_tokens, temperature=1.0, top_p=None, stop_words=None)

Generates the text based on the input prompts.

Parameters:
  • prompts (Iterable[str] | Iterable[list[int]]) – The input prompts. Could be a list of strings or token lists.

  • max_new_tokens (int) – The max output token length.

  • temperature (float) – The sampling temperature

  • stop_words (list[str]) – A list of words the generate will stop on.

  • top_p (float) –

Returns:

a list of output text strings if max_beam_width is 1 or a 2D list with shape [batch, beam].

Return type:

list[str] | list[list[str]]

generate_tokens(prompts, max_new_tokens, temperature=1.0, top_p=None, stop_words=None)

Generates the tokens based on the input prompts.

Parameters:
  • prompts (Iterable[str] | Iterable[list[int]]) – The input prompts. Could be a list of strings or token lists.

  • max_new_tokens (int) – The max output token length.

  • temperature (float) – The sampling temperature.

  • top_p (float) – The nucleus sampling parameter.

  • stop_words (list[str]) – A list of words that the generate stops on.

Returns:

a list of output token lists if max_beam_width is 1 or a 3D list with shape [batch, beam, sequence_len].

Return type:

list[list[int]] | list[list[list[int]]]

property max_beam_width

Get the max beam width from the LLM instance.

property max_seq_len

Get the max sequence length from the LLM instance.