generate

A wrapper over the TensorRT-LLM high level API runner.

Classes

LLM

A wrapper over the tensorrt_llm.hlapi.llm.LLM for LLM profiling and validation.

class LLM

Bases: LLM

A wrapper over the tensorrt_llm.hlapi.llm.LLM for LLM profiling and validation.

__init__(engine_dir, tokenizer=None, kv_cache_config={}, medusa_choices=None)

Initializes the LLM runner class.

Parameters:
generate_context_logits(prompts, temperature=1.0, top_p=None)

Generates the context logits based on the input prompts.

Parameters:
  • prompts (Iterable[str] | Iterable[List[int]]) – The input prompts. Could be a list of strings or token lists.

  • temperature (float) – The sampling temperature.

  • top_p (float) – The nucleus sampling parameter.

  • keep_input_prompt – Set to include input prommpts in the outputs.

Returns:

a tensor list of the context_logits.

Return type:

List[tensor]

generate_text(prompts, max_new_tokens, temperature=1.0, top_p=None, keep_input_prompt=True, stop_words=None)

Generates the text based on the input prompts.

Parameters:
  • prompts (Iterable[str] | Iterable[List[int]]) – The input prompts. Could be a list of strings or token lists.

  • max_new_tokens (int) – The max output token length.

  • temperature (float) – The sampling temperature

  • keep_input_prompt (bool) – Set to include input prommpts in the outputs.

  • stop_words (List[str]) – A list of words the generate will stop on.

  • top_p (float) –

Returns:

a list of output text strings if max_beam_width is 1 or a 2D list with shape [batch, beam].

Return type:

List[str] | List[List[str]]

generate_tokens(prompts, max_new_tokens, temperature=1.0, top_p=None, keep_input_prompt=True, stop_words=None)

Generates the tokens based on the input prompts.

Parameters:
  • prompts (Iterable[str] | Iterable[List[int]]) – The input prompts. Could be a list of strings or token lists.

  • max_new_tokens (int) – The max output token length.

  • temperature (float) – The sampling temperature.

  • top_p (float) – The nucleus sampling parameter.

  • keep_input_prompt (bool) – Set to include input prommpts in the outputs.

  • stop_words (List[str]) – A list of words that the generate stops on.

Returns:

a list of output token lists if max_beam_width is 1 or a 3D list with shape [batch, beam, sequence_len].

Return type:

List[List[int]] | List[List[List[int]]]

property max_beam_width

Get the max beam width from the LLM instance.

property max_input_len

Get the max input length from the LLM instance.