API Reference#

class tensorrt_llm.llmapi.LLM(

model: str | Path,

tokenizer: str | Path | TokenizerBase | PreTrainedTokenizerBase | None = None,

tokenizer_mode: Literal['auto', 'slow'] = 'auto',

skip_tokenizer_init: bool = False,

trust_remote_code: bool = False,

tensor_parallel_size: int = 1,

dtype: str = 'auto',

revision: str | None = None,

tokenizer_revision: str | None = None,

**kwargs: Any,

)[source]#

Bases: object

LLM class is the main class for running a LLM model.

Parameters:

model (Union[str, pathlib.Path]) – The path to the model checkpoint or the model name from the Hugging Face Hub.
tokenizer (Union[str, pathlib.Path, transformers.tokenization_utils_base.PreTrainedTokenizerBase, tensorrt_llm.llmapi.tokenizer.TokenizerBase, NoneType]) – The path to the tokenizer checkpoint or the tokenizer name from the Hugging Face Hub. Defaults to None.
tokenizer_mode (Literal['auto', 'slow']) – The mode to initialize the tokenizer. Defaults to auto.
skip_tokenizer_init (bool) – Whether to skip the tokenizer initialization. Defaults to False.
trust_remote_code (bool) – Whether to trust the remote code. Defaults to False.
tensor_parallel_size (int) – The tensor parallel size. Defaults to 1.
dtype (str) – The data type to use for the model. Defaults to auto.
revision (Optional[str]) – The revision to use for the model. Defaults to None.
tokenizer_revision (Optional[str]) – The revision to use for the tokenizer. Defaults to None.
pipeline_parallel_size (int) – The pipeline parallel size. Defaults to 1.
context_parallel_size (int) – The context parallel size. Defaults to 1.
gpus_per_node (Optional[int]) – The number of GPUs per node. Defaults to None.
moe_cluster_parallel_size (Optional[int]) – The cluster parallel size for MoE models’s expert weights. Defaults to None.
moe_tensor_parallel_size (Optional[int]) – The tensor parallel size for MoE models’s expert weights. Defaults to None.
moe_expert_parallel_size (Optional[int]) – The expert parallel size for MoE models’s expert weights. Defaults to None.
enable_attention_dp (bool) – Enable attention data parallel. Defaults to False.
cp_config (Optional[dict]) – Context parallel config. Defaults to None.
load_format (Literal['auto', 'dummy']) – The format to load the model. Defaults to auto.
enable_lora (bool) – Enable LoRA. Defaults to False.
lora_config (Optional[tensorrt_llm.lora_manager.LoraConfig]) – LoRA configuration for the model. Defaults to None.
enable_prompt_adapter (bool) – Enable prompt adapter. Defaults to False.
max_prompt_adapter_token (int) – The maximum number of prompt adapter tokens. Defaults to 0.
quant_config (Optional[tensorrt_llm.models.modeling_utils.QuantConfig]) – Quantization config. Defaults to None.
kv_cache_config (tensorrt_llm.llmapi.llm_args.KvCacheConfig) – KV cache config. Defaults to None.
enable_chunked_prefill (bool) – Enable chunked prefill. Defaults to False.
guided_decoding_backend (Optional[str]) – Guided decoding backend. Defaults to None.
batched_logits_processor (Optional[tensorrt_llm.sampling_params.BatchedLogitsProcessor]) – Batched logits processor. Defaults to None.
iter_stats_max_iterations (Optional[int]) – The maximum number of iterations for iter stats. Defaults to None.
request_stats_max_iterations (Optional[int]) – The maximum number of iterations for request stats. Defaults to None.
peft_cache_config (Optional[tensorrt_llm.llmapi.llm_args.PeftCacheConfig]) – PEFT cache config. Defaults to None.
scheduler_config (tensorrt_llm.llmapi.llm_args.SchedulerConfig) – Scheduler config. Defaults to None.
cache_transceiver_config (Optional[tensorrt_llm.llmapi.llm_args.CacheTransceiverConfig]) – Cache transceiver config. Defaults to None.
speculative_config (Union[tensorrt_llm.llmapi.llm_args.LookaheadDecodingConfig, tensorrt_llm.llmapi.llm_args.MedusaDecodingConfig, tensorrt_llm.llmapi.llm_args.EagleDecodingConfig, tensorrt_llm.llmapi.llm_args.MTPDecodingConfig, tensorrt_llm.llmapi.llm_args.NGramDecodingConfig, NoneType]) – Speculative decoding config. Defaults to None.
batching_type (Optional[tensorrt_llm.llmapi.llm_args.BatchingType]) – Batching type. Defaults to None.
normalize_log_probs (bool) – Normalize log probabilities. Defaults to False.
max_batch_size (Optional[int]) – The maximum batch size. Defaults to None.
max_input_len (int) – The maximum input length. Defaults to 1024.
max_seq_len (Optional[int]) – The maximum sequence length. Defaults to None.
max_beam_width (int) – The maximum beam width. Defaults to 1.
max_num_tokens (Optional[int]) – The maximum number of tokens. Defaults to None.
backend (Optional[str]) – The backend to use. Defaults to None.
gather_generation_logits (bool) – Gather generation logits. Defaults to False.
enable_tqdm (bool) – Enable tqdm for progress bar. Defaults to False.
build_config (Optional[tensorrt_llm.builder.BuildConfig]) – Build config. Defaults to None.
workspace (Optional[str]) – The workspace for the model. Defaults to None.
enable_build_cache (Union[tensorrt_llm.llmapi.build_cache.BuildCacheConfig, bool]) – Enable build cache. Defaults to False.
extended_runtime_perf_knob_config (Optional[tensorrt_llm.llmapi.llm_args.ExtendedRuntimePerfKnobConfig]) – Extended runtime perf knob config. Defaults to None.
calib_config (Optional[tensorrt_llm.llmapi.llm_args.CalibConfig]) – Calibration config. Defaults to None.
embedding_parallel_mode (str) – The embedding parallel mode. Defaults to SHARDING_ALONG_VOCAB.
fast_build (bool) – Enable fast build. Defaults to False.
kwargs (Any) – Advanced arguments passed to LlmArgs.

tokenizer#

The tokenizer loaded by LLM instance, if any.

Type:: tensorrt_llm.llmapi.tokenizer.TokenizerBase, optional

workspace#

The directory to store intermediate files.

Type:: pathlib.Path

__init__(

model: str | Path,

tokenizer: str | Path | TokenizerBase | PreTrainedTokenizerBase | None = None,

tokenizer_mode: Literal['auto', 'slow'] = 'auto',

skip_tokenizer_init: bool = False,

trust_remote_code: bool = False,

tensor_parallel_size: int = 1,

dtype: str = 'auto',

revision: str | None = None,

tokenizer_revision: str | None = None,

**kwargs: Any,

) → None[source]#

Generate output for the given prompts in the synchronous mode. Synchronous generation accepts either single prompt or batched prompts.

Parameters:

inputs (tensorrt_llm.inputs.data.PromptInputs, Sequence[tensorrt_llm.inputs.data.PromptInputs]) – The prompt text or token ids. It can be single prompt or batched prompts.
sampling_params (tensorrt_llm.sampling_params.SamplingParams, List[tensorrt_llm.sampling_params.SamplingParams], optional) – The sampling params for the generation. Defaults to None. A default one will be used if not provided.
use_tqdm (bool) – Whether to use tqdm to display the progress bar. Defaults to True.
lora_request (tensorrt_llm.executor.request.LoRARequest, Sequence[tensorrt_llm.executor.request.LoRARequest], optional) – LoRA request to use for generation, if any. Defaults to None.
prompt_adapter_request (tensorrt_llm.executor.request.PromptAdapterRequest, Sequence[tensorrt_llm.executor.request.PromptAdapterRequest], optional) – Prompt Adapter request to use for generation, if any. Defaults to None.
kv_cache_retention_config (tensorrt_llm.bindings.executor.KvCacheRetentionConfig, Sequence[tensorrt_llm.bindings.executor.KvCacheRetentionConfig], optional) – Configuration for the request’s retention in the KV Cache. Defaults to None.
disaggregated_params (tensorrt_llm.disaggregated_params.DisaggregatedParams, Sequence[tensorrt_llm.disaggregated_params.DisaggregatedParams], optional) – Disaggregated parameters. Defaults to None.

Returns:

The output data of the completion request to the LLM.

Return type:

Union[tensorrt_llm.llmapi.RequestOutput, List[tensorrt_llm.llmapi.RequestOutput]]

generate_async( inputs: str | List[int] | TextPrompt | TokensPrompt, sampling_params: SamplingParams | None = None, lora_request: LoRARequest | None = None, prompt_adapter_request: PromptAdapterRequest | None = None, streaming: bool = False, kv_cache_retention_config: KvCacheRetentionConfig | None = None, disaggregated_params: DisaggregatedParams | None = None, _postproc_params: PostprocParams | None = None, ) → RequestOutput[source]#

Generate output for the given prompt in the asynchronous mode. Asynchronous generation accepts single prompt only.

Parameters:

inputs (tensorrt_llm.inputs.data.PromptInputs) – The prompt text or token ids; it must be single prompt.
sampling_params (tensorrt_llm.sampling_params.SamplingParams, optional) – The sampling params for the generation. Defaults to None. A default one will be used if not provided.
lora_request (tensorrt_llm.executor.request.LoRARequest, optional) – LoRA request to use for generation, if any. Defaults to None.
prompt_adapter_request (tensorrt_llm.executor.request.PromptAdapterRequest, optional) – Prompt Adapter request to use for generation, if any. Defaults to None.
streaming (bool) – Whether to use the streaming mode for the generation. Defaults to False.
kv_cache_retention_config (tensorrt_llm.bindings.executor.KvCacheRetentionConfig, optional) – Configuration for the request’s retention in the KV Cache. Defaults to None.
disaggregated_params (tensorrt_llm.disaggregated_params.DisaggregatedParams, optional) – Disaggregated parameters. Defaults to None.

Returns:

The output data of the completion request to the LLM.

Return type:

tensorrt_llm.llmapi.RequestOutput

get_kv_cache_events( timeout: float | None = 2, ) → List[dict][source]#

Get iteration KV events from the runtime.

KV events are used to track changes and operations within the KV Cache. Types of events:

KVCacheCreatedData: Indicates the creation of cache blocks.
KVCacheStoredData: Represents a sequence of stored blocks.
KVCacheRemovedData: Contains the hashes of blocks that are being removed from the cache.
KVCacheUpdatedData: Captures updates to existing cache blocks.

To enable KV events:

set event_buffer_max_size to a positive integer in the KvCacheConfig.
set enable_block_reuse to True in the KvCacheConfig.

Parameters:: timeout (float, optional) – Max wait time in seconds when retrieving events from queue. Defaults to 2.
Returns:: A list of runtime events as dict.
Return type:: List[dict]

get_kv_cache_events_async( timeout: float | None = 2, ) → IterationResult[source]#

Get iteration KV events from the runtime.

KV events are used to track changes and operations within the KV Cache. Types of events:

KVCacheCreatedData: Indicates the creation of cache blocks.
KVCacheStoredData: Represents a sequence of stored blocks.
KVCacheRemovedData: Contains the hashes of blocks that are being removed from the cache.
KVCacheUpdatedData: Captures updates to existing cache blocks.

To enable KV events:

set event_buffer_max_size to a positive integer in the KvCacheConfig.
set enable_block_reuse to True in the KvCacheConfig.

Parameters:: timeout (float, optional) – Max wait time in seconds when retrieving events from queue. . Defaults to 2.
Returns:: An async iterable object containing runtime events.
Return type:: tensorrt_llm.executor.result.IterationResult

get_stats(timeout: float | None = 2) → List[dict][source]#

Get iteration statistics from the runtime. To collect statistics, call this function after prompts have been submitted with LLM().generate().

Parameters:

timeout (float, optional) – Max wait time in seconds when retrieving stats from queue. Defaults to 2.

Returns:

A list of runtime stats as dict.: e.g., [‘{“cpuMemUsage”: …, “iter”: 0, …}’, ‘{“cpuMemUsage”: …, “iter”: 1, …}’]

Return type:

List[dict]

get_stats_async( timeout: float | None = 2, ) → IterationResult[source]#

Get iteration statistics from the runtime. To collect statistics, you can call this function in an async coroutine or the /metrics endpoint (if you’re using trtllm-serve) after prompts have been submitted.

Parameters:: timeout (float, optional) – Max wait time in seconds when retrieving stats from queue. Defaults to 2.
Returns:: An async iterable object containing runtime stats.
Return type:: tensorrt_llm.executor.result.IterationResult

save(engine_dir: str) → None[source]#

Save the built engine to the given path.

Parameters:: engine_dir (str) – The path to save the engine.

shutdown() → None[source]#

property tokenizer: TokenizerBase | None#

property workspace: Path#

class tensorrt_llm.llmapi.CompletionOutput( index: int, text: str = '', token_ids: ~typing.List[int] | None = <factory>, cumulative_logprob: float | None = None, logprobs: list[dict[int, ~tensorrt_llm.executor.result.Logprob]] | None = <factory>, prompt_logprobs: list[dict[int, ~tensorrt_llm.executor.result.Logprob]] | None = <factory>, finish_reason: ~typing.Literal['stop', 'length', 'timeout', 'cancelled'] | None = None, stop_reason: int | str | None = None, generation_logits: ~torch.Tensor | None = None, disaggregated_params: ~tensorrt_llm.disaggregated_params.DisaggregatedParams | None = None, _postprocess_result: ~typing.Any = None, )[source]#

Bases: object

The output data of one completion output of a request.

Parameters:

index (int) – The index of the output in the request.
text (str) – The generated output text. Defaults to “”.
token_ids (List[int], optional) – The token ids of the generated output text. Defaults to [].
cumulative_logprob (float, optional) – The cumulative log probability of the generated output text. Defaults to None.
logprobs (TokenLogprobs, optional) – The log probabilities of the top probability words at each position if the logprobs are requested. Defaults to None.
prompt_logprobs (TokenLogprobs, optional) – The log probabilities per prompt token. Defaults to None.
finish_reason (Literal['stop', 'length', 'timeout', 'cancelled'], optional) – The reason why the sequence is finished. Defaults to None.
stop_reason (int, str, optional) – The stop string or token id that caused the completion to stop, None if the completion finished for some other reason. Defaults to None.
generation_logits (torch.Tensor, optional) – The logits on the generated output token ids. Defaults to None.
disaggregated_params (tensorrt_llm.disaggregated_params.DisaggregatedParams, optional) – Parameters needed for disaggregated serving. Includes the type of request, the first generated tokens, the context request id and the any additional state needing to be transferred from context and generation instances. Defaults to None.

length#

The number of generated tokens.

Type:: int

token_ids_diff#

Newly generated token ids.

Type:: List[int]

logprobs_diff#

Logprobs of newly generated tokens.

Type:: List[float]

text_diff#

Newly generated tokens.

Type:: str

__init__( index: int, text: str = '', token_ids: ~typing.List[int] | None = <factory>, cumulative_logprob: float | None = None, logprobs: list[dict[int, ~tensorrt_llm.executor.result.Logprob]] | None = <factory>, prompt_logprobs: list[dict[int, ~tensorrt_llm.executor.result.Logprob]] | None = <factory>, finish_reason: ~typing.Literal['stop', 'length', 'timeout', 'cancelled'] | None = None, stop_reason: int | str | None = None, generation_logits: ~torch.Tensor | None = None, disaggregated_params: ~tensorrt_llm.disaggregated_params.DisaggregatedParams | None = None, _postprocess_result: ~typing.Any = None, ) → None#

cumulative_logprob: float | None#

disaggregated_params: DisaggregatedParams | None#

finish_reason: Literal['stop', 'length', 'timeout', 'cancelled'] | None#

generation_logits: Tensor | None#

index: int#

property length: int#

logprobs: list[dict[int, Logprob]] | None#

property logprobs_diff: List[float]#

prompt_logprobs: list[dict[int, Logprob]] | None#

stop_reason: int | str | None#

text: str#

property text_diff: str#

token_ids: List[int] | None#

property token_ids_diff: List[int]#

class tensorrt_llm.llmapi.RequestOutput[source]#

Bases: DetokenizedGenerationResultBase, GenerationResult

The output data of a completion request to the LLM.

request_id#

The unique ID of the request.

Type:: int

prompt#

The prompt string of the request.

Type:: str, optional

prompt_token_ids#

The token ids of the prompt.

Type:: List[int]

outputs#

The output sequences of the request.

Type:: List[CompletionOutput]

context_logits#

The logits on the prompt token ids.

Type:: torch.Tensor, optional

finished#

Whether the whole request is finished.

Type:: bool

__init__() → None[source]#

property prompt: str | None#

Bases: object

Guided decoding parameters for text generation. Only one of the fields could be effective.

Parameters:

json (str, pydantic.main.BaseModel, dict, optional) – The generated text is amenable to json format with additional user-specified restrictions, namely schema. Defaults to None.
regex (str, optional) – The generated text is amenable to the user-specified regular expression. Defaults to None.
grammar (str, optional) – The generated text is amenable to the user-specified extended Backus-Naur form (EBNF) grammar. Defaults to None.
json_object (bool) – If True, the generated text is amenable to json format. Defaults to False.
structural_tag (str, optional) – The generated text is amenable to the user-specified structural tag. Defaults to None.

__init__( *, json: str | BaseModel | dict | None = None, regex: str | None = None, grammar: str | None = None, json_object: bool = False, structural_tag: str | None = None, ) → None#

grammar: str | None#

json: str | BaseModel | dict | None#

json_object: bool#

regex: str | None#

structural_tag: str | None#

class tensorrt_llm.llmapi.SamplingParams( *, end_id: int | None = None, pad_id: int | None = None, max_tokens: int = 32, bad: str | List[str] | None = None, bad_token_ids: List[int] | None = None, stop: str | List[str] | None = None, stop_token_ids: List[int] | None = None, include_stop_str_in_output: bool = False, embedding_bias: Tensor | None = None, logits_processor: LogitsProcessor | List[LogitsProcessor] | None = None, apply_batched_logits_processor: bool = False, n: int = 1, best_of: int | None = None, use_beam_search: bool = False, top_k: int | None = None, top_p: float | None = None, top_p_min: float | None = None, top_p_reset_ids: int | None = None, top_p_decay: float | None = None, seed: int | None = None, temperature: float | None = None, min_tokens: int | None = None, beam_search_diversity_rate: float | None = None, repetition_penalty: float | None = None, presence_penalty: float | None = None, frequency_penalty: float | None = None, length_penalty: float | None = None, early_stopping: int | None = None, no_repeat_ngram_size: int | None = None, min_p: float | None = None, beam_width_array: List[int] | None = None, logprobs: int | None = None, prompt_logprobs: int | None = None, return_context_logits: bool = False, return_generation_logits: bool = False, exclude_input_from_output: bool = True, return_encoder_output: bool = False, return_perf_metrics: bool = False, additional_model_outputs: List[AdditionalModelOutput] | None = None, _context_logits_auto_enabled: bool = False, _generation_logits_auto_enabled: bool = False, _return_log_probs: bool = False, lookahead_config: LookaheadDecodingConfig | None = None, guided_decoding: GuidedDecodingParams | None = None, ignore_eos: bool = False, detokenize: bool = True, add_special_tokens: bool = True, truncate_prompt_tokens: int | None = None, skip_special_tokens: bool = True, spaces_between_special_tokens: bool = True, )[source]#

Bases: object

Sampling parameters for text generation.

Usage Examples:

use_beam_search is False:

best_of is None: (top-p/top-k) sampling n responses and return n generations

best_of is not None: (top-p/top-k) sampling best_of responses and return n generations (best_of >= n must hold)

use_beam_search is True:

best_of is None: beam search with beam width of n, return n generations

best_of is not None: beam search with beam width of best_of, return n generations (best_of >= n must hold)

Parameters:

end_id (int, optional) – The end token id. Defaults to None.
pad_id (int, optional) – The pad token id. Defaults to None.
max_tokens (int) – The maximum number of tokens to generate. Defaults to 32.
bad (str, List[str], optional) – A string or a list of strings that redirect the generation when they are generated, so that the bad strings are excluded from the returned output. Defaults to None.
bad_token_ids (List[int], optional) – A list of token ids that redirect the generation when they are generated, so that the bad ids are excluded from the returned output. Defaults to None.
stop (str, List[str], optional) – A string or a list of strings that stop the generation when they are generated. The returned output will not contain the stop strings unless include_stop_str_in_output is True. Defaults to None.
stop_token_ids (List[int], optional) – A list of token ids that stop the generation when they are generated. Defaults to None.
include_stop_str_in_output (bool) – Whether to include the stop strings in output text. Defaults to False.
embedding_bias (torch.Tensor, optional) – The embedding bias tensor. Expected type is kFP32 and shape is [vocab_size]. Defaults to None.
logits_processor (tensorrt_llm.sampling_params.LogitsProcessor, List[tensorrt_llm.sampling_params.LogitsProcessor], optional) – The logits postprocessor callback(s). Defaults to None. If a list, each processor is applied in order during generation (supported in PyTorch backend only).
apply_batched_logits_processor (bool) – Whether to apply batched logits postprocessor callback. Defaults to False. The BatchedLogitsProcessor class is recommended for callback creation. The callback must be provided when initializing LLM.
n (int) – Number of sequences to generate. Defaults to 1.
best_of (int, optional) – Number of sequences to consider for best output. Defaults to None.
use_beam_search (bool) – Whether to use beam search. Defaults to False.
top_k (int, optional) – Controls number of logits to sample from. None means using C++ runtime default 0, i.e., all logits. Defaults to None.
top_p (float, optional) – Controls the top-P probability to sample from. None means using C++ runtime default 0.f. Defaults to None.
top_p_min (float, optional) – Controls decay in the top-P algorithm. topPMin is lower-bound. None means using C++ runtime default 1.e-6. Defaults to None.
top_p_reset_ids (int, optional) – Controls decay in the top-P algorithm. Indicates where to reset the decay. None means using C++ runtime default 1. Defaults to None.
top_p_decay (float, optional) – Controls decay in the top-P algorithm. The decay value. None means using C++ runtime default 1.f. Defaults to None.
seed (int, optional) – Controls the random seed used by the random number generator in sampling. None means using C++ runtime default 0. Defaults to None.
temperature (float, optional) – Controls the modulation of logits when sampling new tokens. It can have values > 0.f. None means using C++ runtime default 1.0f. Defaults to None.
min_tokens (int, optional) – Lower bound on the number of tokens to generate. Values < 1 have no effect. None means using C++ runtime default 1. Defaults to None.
beam_search_diversity_rate (float, optional) – Used to penalize tokens based on how often they appear in the sequence. It can have any value > 0.f. Values < 1.f encourages repetition, values > 1.f discourages it. None means using C++ runtime default 1.f. Defaults to None.
repetition_penalty (float, optional) – Used to penalize tokens based on how often they appear in the sequence. It can have any value > 0.f. Values < 1.f encourages repetition, values > 1.f discourages it. None means using C++ runtime default 1.f. Defaults to None.
presence_penalty (float, optional) – Used to penalize tokens already present in the sequence (irrespective of the number of appearances). It can have any values. Values < 0.f encourage repetition, values > 0.f discourage it. None means using C++ runtime default 0.f. Defaults to None.
frequency_penalty (float, optional) – Used to penalize tokens already present in the sequence (dependent on the number of appearances). It can have any values. Values < 0.f encourage repetition, values > 0.f discourage it. None means using C++ runtime default 0.f. Defaults to None.
length_penalty (float, optional) – Controls how to penalize longer sequences in beam search. None means using C++ runtime default 0.f. Defaults to None.
early_stopping (int, optional) – Controls whether the generation process finishes once beamWidth sentences are generated (ends with end_token). None means using C++ runtime default 1. Defaults to None.
no_repeat_ngram_size (int, optional) – Controls how many repeat ngram size are acceptable. None means using C++ runtime default 1 << 30. Defaults to None.
min_p (float, optional) – scale the most likely token to determine the minimum token probability. None means using C++ runtime default 0.0. Defaults to None.
beam_width_array (List[int], optional) – The array of beam width using in Variable-Beam-Width-Search. Defaults to None.
logprobs (int, optional) – Number of log probabilities to return per output token. Defaults to None.
prompt_logprobs (int, optional) – Number of log probabilities to return per prompt token. Defaults to None.
return_context_logits (bool) – Controls if Result should contain the context logits. Defaults to False.
return_generation_logits (bool) – Controls if Result should contain the generation logits. Defaults to False.
exclude_input_from_output (bool) – Controls if output tokens in Result should include the input tokens. Defaults to True.
return_encoder_output (bool) – Controls if Result should contain encoder output hidden states (for encoder-only and encoder-decoder models). Defaults to False.
return_perf_metrics (bool) – Controls if Result should contain the performance metrics for this request. Defaults to False.
additional_model_outputs (List[tensorrt_llm.sampling_params.AdditionalModelOutput], optional) – The additional outputs to gather from the model. Defaults to None.
lookahead_config (tensorrt_llm.bindings.executor.LookaheadDecodingConfig , optional) – Lookahead decoding config. Defaults to None.
guided_decoding (tensorrt_llm.sampling_params.GuidedDecodingParams, optional) – Guided decoding params. Defaults to None.
ignore_eos (bool) – Whether to ignore the EOS token and continue generating tokens after the EOS token is generated. Defaults to False.
detokenize (bool) – Whether to detokenize the output. Defaults to True.
add_special_tokens (bool) – Whether to add special tokens to the prompt. Defaults to True.
truncate_prompt_tokens (int, optional) – If set to an integer k, will use only the last k tokens from the prompt (i.e., left truncation). Defaults to None.
skip_special_tokens (bool) – Whether to skip special tokens in the output. Defaults to True.
spaces_between_special_tokens (bool) – Whether to add spaces between special tokens in the output. Defaults to True.

__init__( *, end_id: int | None = None, pad_id: int | None = None, max_tokens: int = 32, bad: str | List[str] | None = None, bad_token_ids: List[int] | None = None, stop: str | List[str] | None = None, stop_token_ids: List[int] | None = None, include_stop_str_in_output: bool = False, embedding_bias: Tensor | None = None, logits_processor: LogitsProcessor | List[LogitsProcessor] | None = None, apply_batched_logits_processor: bool = False, n: int = 1, best_of: int | None = None, use_beam_search: bool = False, top_k: int | None = None, top_p: float | None = None, top_p_min: float | None = None, top_p_reset_ids: int | None = None, top_p_decay: float | None = None, seed: int | None = None, temperature: float | None = None, min_tokens: int | None = None, beam_search_diversity_rate: float | None = None, repetition_penalty: float | None = None, presence_penalty: float | None = None, frequency_penalty: float | None = None, length_penalty: float | None = None, early_stopping: int | None = None, no_repeat_ngram_size: int | None = None, min_p: float | None = None, beam_width_array: List[int] | None = None, logprobs: int | None = None, prompt_logprobs: int | None = None, return_context_logits: bool = False, return_generation_logits: bool = False, exclude_input_from_output: bool = True, return_encoder_output: bool = False, return_perf_metrics: bool = False, additional_model_outputs: List[AdditionalModelOutput] | None = None, _context_logits_auto_enabled: bool = False, _generation_logits_auto_enabled: bool = False, _return_log_probs: bool = False, lookahead_config: LookaheadDecodingConfig | None = None, guided_decoding: GuidedDecodingParams | None = None, ignore_eos: bool = False, detokenize: bool = True, add_special_tokens: bool = True, truncate_prompt_tokens: int | None = None, skip_special_tokens: bool = True, spaces_between_special_tokens: bool = True, ) → None#

add_special_tokens: bool#

additional_model_outputs: List[AdditionalModelOutput] | None#

apply_batched_logits_processor: bool#

bad: str | List[str] | None#

bad_token_ids: List[int] | None#

beam_search_diversity_rate: float | None#

beam_width_array: List[int] | None#

best_of: int | None#

detokenize: bool#

early_stopping: int | None#

embedding_bias: Tensor | None#

end_id: int | None#

exclude_input_from_output: bool#

frequency_penalty: float | None#

guided_decoding: GuidedDecodingParams | None#

ignore_eos: bool#

include_stop_str_in_output: bool#

length_penalty: float | None#

logits_processor: LogitsProcessor | List[LogitsProcessor] | None#

logprobs: int | None#

lookahead_config: LookaheadDecodingConfig | None#

max_tokens: int#

min_p: float | None#

min_tokens: int | None#

n: int#

no_repeat_ngram_size: int | None#

pad_id: int | None#

presence_penalty: float | None#

prompt_logprobs: int | None#

repetition_penalty: float | None#

return_context_logits: bool#

return_encoder_output: bool#

return_generation_logits: bool#

return_perf_metrics: bool#

seed: int | None#

skip_special_tokens: bool#

spaces_between_special_tokens: bool#

stop: str | List[str] | None#

stop_token_ids: List[int] | None#

temperature: float | None#

top_k: int | None#

top_p: float | None#

top_p_decay: float | None#

top_p_min: float | None#

top_p_reset_ids: int | None#

truncate_prompt_tokens: int | None#

use_beam_search: bool#

class tensorrt_llm.llmapi.DisaggregatedParams( *, request_type: str | None = None, first_gen_tokens: List[int] | None = None, ctx_request_id: int | None = None, opaque_state: bytes | None = None, draft_tokens: List[int] | None = None, )[source]#

Bases: object

Disaggregated seving parameters

Parameters:

request_type (str) – The type of request (“context_only” or “generation_only”)
first_gen_tokens (List[int]) – The first tokens of the generation request
ctx_request_id (int) – The context request id
opaque_state (bytes) – Any additional state needing to be exchanged between context and gen instances

__init__( *, request_type: str | None = None, first_gen_tokens: List[int] | None = None, ctx_request_id: int | None = None, opaque_state: bytes | None = None, draft_tokens: List[int] | None = None, ) → None#

ctx_request_id: int | None#

draft_tokens: List[int] | None#

first_gen_tokens: List[int] | None#

get_context_phase_params() → ContextPhaseParams[source]#

get_request_type() → RequestType[source]#

opaque_state: bytes | None#

request_type: str | None#

class tensorrt_llm.llmapi.KvCacheConfig( *, enable_block_reuse: bool = True, max_tokens: int | None = None, max_attention_window: List[int] | None = None, sink_token_length: int | None = None, free_gpu_memory_fraction: float | None = None, host_cache_size: int | None = None, onboard_blocks: bool = True, cross_kv_cache_fraction: float | None = None, secondary_offload_min_priority: int | None = None, event_buffer_max_size: int = 0, enable_partial_reuse: bool = True, copy_on_partial_reuse: bool = True, )[source]#

Bases: BaseModel, PybindMirror

Configuration for the KV cache.

field copy_on_partial_reuse: bool = True#: Whether partially matched blocks that are in use can be reused after copying them.

field cross_kv_cache_fraction: float | None = None#: The fraction of the KV Cache memory should be reserved for cross attention. If set to p, self attention will use 1-p of KV Cache memory and cross attention will use p of KV Cache memory. Default is 50%. Should only be set when using encoder-decoder model.

field enable_block_reuse: bool = True#: Controls if KV cache blocks can be reused for different requests.

field enable_partial_reuse: bool = True#: Whether blocks that are only partially matched can be reused.

field event_buffer_max_size: int = 0#: Maximum size of the event buffer. If set to 0, the event buffer will not be used.

field free_gpu_memory_fraction: float | None = None#: The fraction of GPU memory fraction that should be allocated for the KV cache. Default is 90%. If both max_tokens and free_gpu_memory_fraction are specified, memory corresponding to the minimum will be used.

field host_cache_size: int | None = None#: Size of the host cache in bytes. If both max_tokens and host_cache_size are specified, memory corresponding to the minimum will be used.

field max_attention_window: List[int] | None = None#: Size of the attention window for each sequence. Only the last tokens will be stored in the KV cache. If the number of elements in max_attention_window is less than the number of layers, max_attention_window will be repeated multiple times to the number of layers.

field max_tokens: int | None = None#: The maximum number of tokens that should be stored in the KV cache. If both max_tokens and free_gpu_memory_fraction are specified, memory corresponding to the minimum will be used.

model_config: ClassVar[ConfigDict] = {}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

field onboard_blocks: bool = True#: Controls if blocks are onboarded.

field secondary_offload_min_priority: int | None = None#: Only blocks with priority > mSecondaryOfflineMinPriority can be offloaded to secondary memory.

field sink_token_length: int | None = None#: Number of sink tokens (tokens to always keep in attention window).

class tensorrt_llm.llmapi.KvCacheRetentionConfig#

Bases: pybind11_object

class TokenRangeRetentionConfig#

Bases: pybind11_object

__init__( self: tensorrt_llm.bindings.executor.KvCacheRetentionConfig.TokenRangeRetentionConfig, token_start: int, token_end: int | None, priority: int, duration_ms: datetime.timedelta | None = None, ) → None#

property duration_ms#

property priority#

property token_end#

property token_start#

__init__( self: tensorrt_llm.bindings.executor.KvCacheRetentionConfig, token_range_retention_configs: list[tensorrt_llm.bindings.executor.KvCacheRetentionConfig.TokenRangeRetentionConfig], decode_retention_priority: int = 35, decode_duration_ms: datetime.timedelta | None = None, transfer_mode: tensorrt_llm.bindings.executor.KvCacheTransferMode = DRAM, directory: str | None = None, ) → None#

property decode_duration_ms#

property decode_retention_priority#

property directory#

property token_range_retention_configs#

property transfer_mode#

class tensorrt_llm.llmapi.LookaheadDecodingConfig( *, max_draft_len: int | None = None, speculative_model: str | Path | None = None, max_window_size: int = 4, max_ngram_size: int = 3, max_verification_set_size: int = 4, )[source]#

Bases: DecodingBaseConfig, PybindMirror

Configuration for lookahead speculative decoding.

__init__(**data)[source]#

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

calculate_speculative_resource()[source]#

decoding_type: ClassVar[str] = 'Lookahead'#

classmethod from_dict(data: dict)[source]#

field max_ngram_size: int = 3#

Number of tokens per NGram.

Validated by:

validate_positive_values

field max_verification_set_size: int = 4#

Number of NGrams in verification branch per step.

Validated by:

validate_positive_values

field max_window_size: int = 4#

Number of NGrams in lookahead branch per step.

Validated by:

validate_positive_values

model_config: ClassVar[ConfigDict] = {}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

validator validate_positive_values » max_ngram_size, max_window_size, max_verification_set_size[source]#

Bases: DecodingBaseConfig

decoding_type: ClassVar[str] = 'Medusa'#

classmethod from_dict(data: dict)[source]#

field medusa_choices: List[List[int]] | None = None#

model_config: ClassVar[ConfigDict] = {}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

field num_medusa_heads: int | None = None#

Bases: DecodingBaseConfig

decoding_type: ClassVar[str] = 'Eagle'#

field dynamic_tree_max_topK: int | None = None#

field eagle3_one_model: bool | None = True#

field eagle_choices: List[List[int]] | None = None#

classmethod from_dict(data: dict)[source]#

field greedy_sampling: bool | None = True#

field max_non_leaves_per_layer: int | None = None#

model_config: ClassVar[ConfigDict] = {}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

field num_eagle_layers: int | None = None#

field posterior_threshold: float | None = None#

field pytorch_eagle_weights_path: str | None = None#

field use_dynamic_tree: bool | None = False#

Bases: DecodingBaseConfig

decoding_type: ClassVar[str] = 'MTP'#

classmethod from_dict(data: dict)[source]#

model_config: ClassVar[ConfigDict] = {}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

field num_nextn_predict_layers: int | None = 1#

field relaxed_delta: float | None = 0.0#

field relaxed_topk: int | None = 1#

field use_relaxed_acceptance_for_thinking: bool | None = False#

class tensorrt_llm.llmapi.SchedulerConfig( *, capacity_scheduler_policy: CapacitySchedulerPolicy = CapacitySchedulerPolicy.GUARANTEED_NO_EVICT, context_chunking_policy: ContextChunkingPolicy | None = None, dynamic_batch_config: DynamicBatchConfig | None = None, )[source]#

Bases: BaseModel, PybindMirror

field capacity_scheduler_policy: CapacitySchedulerPolicy = CapacitySchedulerPolicy.GUARANTEED_NO_EVICT#: The capacity scheduler policy to use

field context_chunking_policy: ContextChunkingPolicy | None = None#: The context chunking policy to use

field dynamic_batch_config: DynamicBatchConfig | None = None#: The dynamic batch config to use

model_config: ClassVar[ConfigDict] = {}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class tensorrt_llm.llmapi.CapacitySchedulerPolicy( value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None, )[source]#

Bases: StrEnum

GUARANTEED_NO_EVICT = 'GUARANTEED_NO_EVICT'#

MAX_UTILIZATION = 'MAX_UTILIZATION'#

STATIC_BATCH = 'STATIC_BATCH'#

class tensorrt_llm.llmapi.BuildConfig( max_input_len: int = 1024, max_seq_len: int = None, opt_batch_size: int = 8, max_batch_size: int = 2048, max_beam_width: int = 1, max_num_tokens: int = 8192, opt_num_tokens: Optional[int] = None, max_prompt_embedding_table_size: int = 0, kv_cache_type: tensorrt_llm.bindings.KVCacheType = None, gather_context_logits: int = False, gather_generation_logits: int = False, strongly_typed: bool = True, force_num_profiles: Optional[int] = None, profiling_verbosity: str = 'layer_names_only', enable_debug_output: bool = False, max_draft_len: int = 0, speculative_decoding_mode: tensorrt_llm.models.modeling_utils.SpeculativeDecodingMode = <SpeculativeDecodingMode.NONE: 1>, use_refit: bool = False, input_timing_cache: str = None, output_timing_cache: str = 'model.cache', lora_config: tensorrt_llm.lora_manager.LoraConfig = <factory>, auto_parallel_config: tensorrt_llm.auto_parallel.config.AutoParallelConfig = <factory>, weight_sparsity: bool = False, weight_streaming: bool = False, plugin_config: tensorrt_llm.plugin.plugin.PluginConfig = <factory>, use_strip_plan: bool = False, max_encoder_input_len: int = 1024, dry_run: bool = False, visualize_network: str = None, monitor_memory: bool = False, use_mrope: bool = False, )[source]#

Bases: object

__init__( max_input_len: int = 1024, max_seq_len: int = None, opt_batch_size: int = 8, max_batch_size: int = 2048, max_beam_width: int = 1, max_num_tokens: int = 8192, opt_num_tokens: int | None = None, max_prompt_embedding_table_size: int = 0, kv_cache_type: ~tensorrt_llm.bindings.KVCacheType = None, gather_context_logits: int = False, gather_generation_logits: int = False, strongly_typed: bool = True, force_num_profiles: int | None = None, profiling_verbosity: str = 'layer_names_only', enable_debug_output: bool = False, max_draft_len: int = 0, speculative_decoding_mode: ~tensorrt_llm.models.modeling_utils.SpeculativeDecodingMode = <SpeculativeDecodingMode.NONE: 1>, use_refit: bool = False, input_timing_cache: str = None, output_timing_cache: str = 'model.cache', lora_config: ~tensorrt_llm.lora_manager.LoraConfig = <factory>, auto_parallel_config: ~tensorrt_llm.auto_parallel.config.AutoParallelConfig = <factory>, weight_sparsity: bool = False, weight_streaming: bool = False, plugin_config: ~tensorrt_llm.plugin.plugin.PluginConfig = <factory>, use_strip_plan: bool = False, max_encoder_input_len: int = 1024, dry_run: bool = False, visualize_network: str = None, monitor_memory: bool = False, use_mrope: bool = False, ) → None#

auto_parallel_config: AutoParallelConfig#

dry_run: bool = False#

enable_debug_output: bool = False#

force_num_profiles: int | None = None#

classmethod from_dict(config, plugin_config=None)[source]#

classmethod from_json_file(config_file, plugin_config=None)[source]#

gather_context_logits: int = False#

gather_generation_logits: int = False#

input_timing_cache: str = None#

kv_cache_type: KVCacheType = None#

lora_config: LoraConfig#

max_batch_size: int = 2048#

max_beam_width: int = 1#

max_draft_len: int = 0#

max_encoder_input_len: int = 1024#

max_input_len: int = 1024#

max_num_tokens: int = 8192#

max_prompt_embedding_table_size: int = 0#

max_seq_len: int = None#

monitor_memory: bool = False#

opt_batch_size: int = 8#

opt_num_tokens: int | None = None#

output_timing_cache: str = 'model.cache'#

plugin_config: PluginConfig#

profiling_verbosity: str = 'layer_names_only'#

speculative_decoding_mode: SpeculativeDecodingMode = 1#

strongly_typed: bool = True#

to_dict()[source]#

update(**kwargs)[source]#

update_from_dict(config: dict)[source]#

update_kv_cache_type(model_architecture: str)[source]#

use_mrope: bool = False#

use_refit: bool = False#

use_strip_plan: bool = False#

visualize_network: str = None#

weight_sparsity: bool = False#

weight_streaming: bool = False#

class tensorrt_llm.llmapi.QuantConfig( quant_algo: QuantAlgo | None = None, kv_cache_quant_algo: QuantAlgo | None = None, group_size: int = 128, smoothquant_val: float = 0.5, clamp_val: List[float] | None = None, use_meta_recipe: bool = False, has_zero_point: bool = False, pre_quant_scale: bool = False, exclude_modules: List[str] | None = None, )[source]#

Bases: object

Serializable quantization configuration class, part of the PretrainedConfig.

Parameters:

quant_algo (tensorrt_llm.quantization.mode.QuantAlgo, optional) – Quantization algorithm. Defaults to None.
kv_cache_quant_algo (tensorrt_llm.quantization.mode.QuantAlgo, optional) – KV cache quantization algorithm. Defaults to None.
group_size (int) – The group size for group-wise quantization. Defaults to 128.
smoothquant_val (float) – The smoothing parameter alpha used in smooth quant. Defaults to 0.5.
clamp_val (List[float], optional) – The clamp values used in FP8 rowwise quantization. Defaults to None.
use_meta_recipe (bool) – Whether to use Meta’s recipe for FP8 rowwise quantization. Defaults to False.
has_zero_point (bool) – Whether to use zero point for quantization. Defaults to False.
pre_quant_scale (bool) – Whether to use pre-quant scale for quantization. Defaults to False.
exclude_modules (List[str], optional) – The module name patterns that are skipped in quantization. Defaults to None.

__init__( quant_algo: QuantAlgo | None = None, kv_cache_quant_algo: QuantAlgo | None = None, group_size: int = 128, smoothquant_val: float = 0.5, clamp_val: List[float] | None = None, use_meta_recipe: bool = False, has_zero_point: bool = False, pre_quant_scale: bool = False, exclude_modules: List[str] | None = None, ) → None#

clamp_val: List[float] | None = None#

exclude_modules: List[str] | None = None#

classmethod from_dict( config: dict, ) → QuantConfig[source]#

Create a QuantConfig instance from a dict.

Parameters:: config (dict) – The dict used to create QuantConfig.
Returns:: The QuantConfig created from dict.
Return type:: tensorrt_llm.models.modeling_utils.QuantConfig

group_size: int = 128#

has_zero_point: bool = False#

is_module_excluded_from_quantization(name: str) → bool[source]#

Check if the module is excluded from quantization.

Parameters:: name (str) – The name of the module.
Returns:: True if the module is excluded from quantization, False otherwise.
Return type:: bool

kv_cache_quant_algo: QuantAlgo | None = None#

property layer_quant_mode: QuantMode#

pre_quant_scale: bool = False#

quant_algo: QuantAlgo | None = None#

property quant_mode: QuantModeWrapper#

smoothquant_val: float = 0.5#

to_dict() → dict[source]#

Dump a QuantConfig instance to a dict.

Returns:: The dict dumped from QuantConfig.
Return type:: dict

use_meta_recipe: bool = False#

class tensorrt_llm.llmapi.QuantAlgo( value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None, )[source]#

Bases: StrEnum

FP8 = 'FP8'#

FP8_BLOCK_SCALES = 'FP8_BLOCK_SCALES'#

FP8_PER_CHANNEL_PER_TOKEN = 'FP8_PER_CHANNEL_PER_TOKEN'#

INT8 = 'INT8'#

MIXED_PRECISION = 'MIXED_PRECISION'#

NO_QUANT = 'NO_QUANT'#

NVFP4 = 'NVFP4'#

W4A16 = 'W4A16'#

W4A16_AWQ = 'W4A16_AWQ'#

W4A16_GPTQ = 'W4A16_GPTQ'#

W4A8_AWQ = 'W4A8_AWQ'#

W4A8_QSERVE_PER_CHANNEL = 'W4A8_QSERVE_PER_CHANNEL'#

W4A8_QSERVE_PER_GROUP = 'W4A8_QSERVE_PER_GROUP'#

W8A16 = 'W8A16'#

W8A16_GPTQ = 'W8A16_GPTQ'#

W8A8_SQ_PER_CHANNEL = 'W8A8_SQ_PER_CHANNEL'#

W8A8_SQ_PER_CHANNEL_PER_TENSOR_PLUGIN = 'W8A8_SQ_PER_CHANNEL_PER_TENSOR_PLUGIN'#

W8A8_SQ_PER_CHANNEL_PER_TOKEN_PLUGIN = 'W8A8_SQ_PER_CHANNEL_PER_TOKEN_PLUGIN'#

W8A8_SQ_PER_TENSOR_PER_TOKEN_PLUGIN = 'W8A8_SQ_PER_TENSOR_PER_TOKEN_PLUGIN'#

W8A8_SQ_PER_TENSOR_PLUGIN = 'W8A8_SQ_PER_TENSOR_PLUGIN'#

class tensorrt_llm.llmapi.CalibConfig( *, device: Literal['cuda', 'cpu'] = 'cuda', calib_dataset: str = 'cnn_dailymail', calib_batches: int = 512, calib_batch_size: int = 1, calib_max_seq_length: int = 512, random_seed: int = 1234, tokenizer_max_seq_length: int = 2048, )[source]#

Bases: BaseModel

Calibration configuration.

field calib_batch_size: int = 1#: The batch size that the calibration runs.

field calib_batches: int = 512#: The number of batches that the calibration runs.

field calib_dataset: str = 'cnn_dailymail'#: The name or local path of calibration dataset.

field calib_max_seq_length: int = 512#: The maximum sequence length that the calibration runs.

field device: Literal['cuda', 'cpu'] = 'cuda'#: The device to run calibration.

classmethod from_dict( config: dict, ) → CalibConfig[source]#

Create a CalibConfig instance from a dict.

Parameters:: config (dict) – The dict used to create CalibConfig.
Returns:: The CalibConfig created from dict.
Return type:: tensorrt_llm.llmapi.CalibConfig

model_config: ClassVar[ConfigDict] = {}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

field random_seed: int = 1234#: The random seed used for calibration.

to_dict() → dict[source]#

Dump a CalibConfig instance to a dict.

Returns:: The dict dumped from CalibConfig.
Return type:: dict

field tokenizer_max_seq_length: int = 2048#: The maximum sequence length to initialize tokenizer for calibration.

class tensorrt_llm.llmapi.BuildCacheConfig( cache_root: Path | None = None, max_records: int = 10, max_cache_storage_gb: float = 256, )[source]#

Bases: object

Configuration for the build cache.

cache_root#

The root directory for the build cache.

Type:: str

max_records#

The maximum number of records to store in the cache.

Type:: int

max_cache_storage_gb#

The maximum amount of storage (in GB) to use for the cache.

Type:: float

Note

The build-cache assumes the weights of the model are not changed during the execution. If the weights are changed, you should remove the caches manually.

__init__( cache_root: Path | None = None, max_records: int = 10, max_cache_storage_gb: float = 256, )[source]#

property cache_root: Path#

property max_cache_storage_gb: float#

property max_records: int#

class tensorrt_llm.llmapi.RequestError[source]#

Bases: RuntimeError

The error raised when the request is failed.

class tensorrt_llm.llmapi.MpiCommSession(comm=None, n_workers: int = 1)[source]#

Bases: MpiSession

__init__(comm=None, n_workers: int = 1)[source]#

abort()[source]#

get_comm()[source]#

shutdown(wait=True)[source]#

submit(

task: Callable[[...], T],

*args,

**kwargs,

) → List[Future[T]][source]#

Submit a task to MPI workers.

Parameters:

task – The task to be submitted.
args – Positional arguments for the task.
kwargs – Keyword arguments for the task.

submit_sync(

task: Callable[[...], T],

*args,

**kwargs,

) → List[T][source]#

class tensorrt_llm.llmapi.ExtendedRuntimePerfKnobConfig( *, multi_block_mode: bool = True, enable_context_fmha_fp32_acc: bool = False, cuda_graph_mode: bool = False, cuda_graph_cache_size: int = 0, )[source]#

Bases: BaseModel, PybindMirror

Configuration for extended runtime performance knobs.

field cuda_graph_cache_size: int = 0#: Number of cuda graphs to be cached in the runtime. The larger the cache, the better the perf, but more GPU memory is consumed.

field cuda_graph_mode: bool = False#: Whether to use CUDA graph mode.

field enable_context_fmha_fp32_acc: bool = False#: Whether to enable context FMHA FP32 accumulation.

model_config: ClassVar[ConfigDict] = {}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

field multi_block_mode: bool = True#: Whether to use multi-block mode.

class tensorrt_llm.llmapi.BatchingType( value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None, )[source]#

Bases: StrEnum

INFLIGHT = 'INFLIGHT'#

STATIC = 'STATIC'#

class tensorrt_llm.llmapi.ContextChunkingPolicy( value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None, )[source]#

Bases: StrEnum

Context chunking policy.

EQUAL_PROGRESS = 'EQUAL_PROGRESS'#

FIRST_COME_FIRST_SERVED = 'FIRST_COME_FIRST_SERVED'#

class tensorrt_llm.llmapi.DynamicBatchConfig( *, enable_batch_size_tuning: bool, enable_max_num_tokens_tuning: bool, dynamic_batch_moving_average_window: int, )[source]#

Bases: BaseModel, PybindMirror

Dynamic batch configuration.

Controls how batch size and token limits are dynamically adjusted at runtime.

field dynamic_batch_moving_average_window: int [Required]#: The window size for moving average of input and output length which is used to calculate dynamic batch size and max num tokens

field enable_batch_size_tuning: bool [Required]#: Controls if the batch size should be tuned dynamically

field enable_max_num_tokens_tuning: bool [Required]#: Controls if the max num tokens should be tuned dynamically

model_config: ClassVar[ConfigDict] = {}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class tensorrt_llm.llmapi.CacheTransceiverConfig(*, max_num_tokens: int | None = None)[source]#

Bases: BaseModel, PybindMirror

Configuration for the cache transceiver.

field max_num_tokens: int | None = None#: The max number of tokens the transfer buffer can fit.

model_config: ClassVar[ConfigDict] = {}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class tensorrt_llm.llmapi.NGramDecodingConfig( *, max_draft_len: int | None = None, speculative_model: str | Path | None = None, prompt_lookup_num_tokens: int = 2, max_matching_ngram_size: int = 4, is_keep_all: bool = True, is_use_oldest: bool = True, is_public_pool: bool = True, )[source]#

Bases: DecodingBaseConfig

Configuration for NGram drafter speculative decoding.

Parameters:

prompt_lookup_num_tokens – int The length maximum of draft tokens (can be understood as length maximum of output draft tokens).
max_matching_ngram_size – int The length maximum of searching tokens (can be understood as length maximum of input tokens to search).
is_keep_all – bool = True Whether to keep all candidate pattern-matches pairs, only one match is kept for each pattern if False.
is_use_oldest – bool = True Whether to provide the oldest match when pattern is hit, the newest one is provided if False.
is_public_pool – bool = True Whether to use a common pool for all requests, or the pool is private for each request if False.

decoding_type: ClassVar[str] = 'NGram'#

classmethod from_dict(data: dict)[source]#

field is_keep_all: bool = True#

field is_public_pool: bool = True#

field is_use_oldest: bool = True#

field max_matching_ngram_size: int = 4#

model_config: ClassVar[ConfigDict] = {}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

field prompt_lookup_num_tokens: int = 2#

tensorrt_llm.llmapi.LlmArgs#: alias of TrtLlmArgs

class tensorrt_llm.llmapi.TorchLlmArgs(

*,

model: str | ~pathlib.Path,

tokenizer: str | ~pathlib.Path | ~transformers.tokenization_utils_base.PreTrainedTokenizerBase | ~tensorrt_llm.llmapi.tokenizer.TokenizerBase | None = None,

tokenizer_mode: ~typing.Literal['auto',

'slow'] = 'auto',

skip_tokenizer_init: bool = False,

trust_remote_code: bool = False,

tensor_parallel_size: int = 1,

dtype: str = 'auto',

revision: str | None = None,

tokenizer_revision: str | None = None,

pipeline_parallel_size: int = 1,

context_parallel_size: int = 1,

gpus_per_node: int | None = None,

moe_cluster_parallel_size: int | None = None,

moe_tensor_parallel_size: int | None = None,

moe_expert_parallel_size: int | None = None,

enable_attention_dp: bool = False,

cp_config: dict | None = <factory>,

load_format: str | ~tensorrt_llm.llmapi.llm_args.LoadFormat = LoadFormat.AUTO,

enable_lora: bool = False,

max_lora_rank: int | None = None,

max_loras: int = 4,

max_cpu_loras: int = 4,

lora_config: ~tensorrt_llm.lora_manager.LoraConfig | None = None,

enable_prompt_adapter: bool = False,

max_prompt_adapter_token: int = 0,

quant_config: ~tensorrt_llm.models.modeling_utils.QuantConfig | None = None,

kv_cache_config: ~tensorrt_llm.llmapi.llm_args.KvCacheConfig = <factory>,

enable_chunked_prefill: bool = False,

guided_decoding_backend: str | None = None,

batched_logits_processor: object | None = None,

iter_stats_max_iterations: int | None = None,

request_stats_max_iterations: int | None = None,

peft_cache_config: ~tensorrt_llm.llmapi.llm_args.PeftCacheConfig | None = None,

scheduler_config: ~tensorrt_llm.llmapi.llm_args.SchedulerConfig = <factory>,

cache_transceiver_config: ~tensorrt_llm.llmapi.llm_args.CacheTransceiverConfig | None = None,

speculative_config: ~tensorrt_llm.llmapi.llm_args.LookaheadDecodingConfig | ~tensorrt_llm.llmapi.llm_args.MedusaDecodingConfig | ~tensorrt_llm.llmapi.llm_args.EagleDecodingConfig | ~tensorrt_llm.llmapi.llm_args.MTPDecodingConfig | ~tensorrt_llm.llmapi.llm_args.NGramDecodingConfig | None = None,

batching_type: ~tensorrt_llm.llmapi.llm_args.BatchingType | None = None,

normalize_log_probs: bool = False,

max_batch_size: int | None = None,

max_input_len: int = 1024,

max_seq_len: int | None = None,

max_beam_width: int = 1,

max_num_tokens: int | None = None,

backend: str | None = None,

gather_generation_logits: bool = False,

_num_postprocess_workers: int = 0,

_postprocess_tokenizer_dir: str | None = None,

_reasoning_parser: str | None = None,

decoding_config: object | None = None,

_mpi_session: object | None = None,

build_config: object | None = None,

use_cuda_graph: bool = False,

cuda_graph_batch_sizes: ~typing.List[int] | None = None,

cuda_graph_max_batch_size: int = 0,

cuda_graph_padding_enabled: bool = False,

disable_overlap_scheduler: bool = False,

moe_max_num_tokens: int | None = None,

moe_load_balancer: object | str | None = None,

attn_backend: str = 'TRTLLM',

moe_backend: str = 'CUTLASS',

mixed_sampler: bool = False,

enable_trtllm_sampler: bool = False,

kv_cache_dtype: str = 'auto',

use_kv_cache: bool = True,

enable_iter_perf_stats: bool = False,

enable_iter_req_stats: bool = False,

print_iter_log: bool = False,

torch_compile_enabled: bool = False,

torch_compile_fullgraph: bool = True,

torch_compile_inductor_enabled: bool = False,

torch_compile_piecewise_cuda_graph: bool = False,

torch_compile_enable_userbuffers: bool = True,

autotuner_enabled: bool = True,

enable_layerwise_nvtx_marker: bool = False,

auto_deploy_config: object | None = None,

enable_min_latency: bool = False,

**extra_data: ~typing.Any,

)[source]#

Bases: BaseLlmArgs

field attn_backend: str = 'TRTLLM'#

Attention backend to use.

Validated by:

validate_cuda_graph_config

field auto_deploy_config: object | None = None#

Auto deploy config.

Validated by:

validate_cuda_graph_config

field autotuner_enabled: bool = True#

Enable autotuner only when torch compile is enabled.

Validated by:

validate_cuda_graph_config

field build_config: object | None = None#

Build config.

Validated by:

validate_cuda_graph_config

validator convert_load_format » load_format[source]#

field cuda_graph_batch_sizes: List[int] | None = None#

List of batch sizes to create CUDA graphs for.

Validated by:

validate_cuda_graph_config

field cuda_graph_max_batch_size: int = 0#

Maximum batch size for CUDA graphs.

Validated by:

validate_cuda_graph_config
validate_cuda_graph_max_batch_size

field cuda_graph_padding_enabled: bool = False#

If true, batches are rounded up to the nearest cuda_graph_batch_size. This is usually a net win for performance.

Validated by:

validate_cuda_graph_config

decoding_config: object | None#

Read-only data descriptor used to emit a runtime deprecation warning before accessing a deprecated field.

msg#: The deprecation message to be emitted.

wrapped_property#: The property instance if the deprecated field is a computed field, or None.

field_name#: The name of the field being deprecated.

field disable_overlap_scheduler: bool = False#

Disable the overlap scheduler.

Validated by:

validate_cuda_graph_config

field enable_iter_perf_stats: bool = False#

Enable iteration performance statistics.

Validated by:

validate_cuda_graph_config

field enable_iter_req_stats: bool = False#

If true, enables per request stats per iteration. Must also set enable_iter_perf_stats to true to get request stats.

Validated by:

validate_cuda_graph_config

field enable_layerwise_nvtx_marker: bool = False#

If true, enable layerwise nvtx marker.

Validated by:

validate_cuda_graph_config

field enable_min_latency: bool = False#

If true, enable min-latency mode. Currently only used for Llama4.

Validated by:

validate_cuda_graph_config

field enable_trtllm_sampler: bool = False#

If true, will use the TRTLLM sampler instead of the PyTorch sampler. The TRTLLM sampler has a wide coverage of sampling strategies.

Validated by:

validate_cuda_graph_config

property extra_resource_managers: Dict[str, object]#

get_pytorch_backend_config() → PyTorchConfig[source]#

field kv_cache_dtype: str = 'auto'#

Data type for KV cache.

Validated by:

validate_cuda_graph_config

field load_format: str | LoadFormat = LoadFormat.AUTO#

How to load the model weights. By default, detect the weight type from the model checkpoint.

Validated by:

convert_load_format
validate_cuda_graph_config

max_cpu_loras: int#

Read-only data descriptor used to emit a runtime deprecation warning before accessing a deprecated field.

msg#: The deprecation message to be emitted.

wrapped_property#: The property instance if the deprecated field is a computed field, or None.

field_name#: The name of the field being deprecated.

max_lora_rank: int | None#

Read-only data descriptor used to emit a runtime deprecation warning before accessing a deprecated field.

msg#: The deprecation message to be emitted.

wrapped_property#: The property instance if the deprecated field is a computed field, or None.

field_name#: The name of the field being deprecated.

max_loras: int#

Read-only data descriptor used to emit a runtime deprecation warning before accessing a deprecated field.

msg#: The deprecation message to be emitted.

wrapped_property#: The property instance if the deprecated field is a computed field, or None.

field_name#: The name of the field being deprecated.

field mixed_sampler: bool = False#

If true, will iterate over sampling_params of each request and use the corresponding sampling strategy, e.g. top-k, top-p, etc.

Validated by:

validate_cuda_graph_config

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'allow'}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(_TorchLlmArgs__context)[source]#: Override this method to perform additional initialization after __init__ and model_construct. This is useful if you want to do some validation that requires the entire model to be initialized.

field moe_backend: str = 'CUTLASS'#

MoE backend to use.

Validated by:

validate_cuda_graph_config

field moe_load_balancer: object | str | None = None#

Configuration for MoE load balancing.

Validated by:

validate_cuda_graph_config

field moe_max_num_tokens: int | None = None#

If set, at most moe_max_num_tokens tokens will be sent to torch.ops.trtllm.fused_moe at the same time. If the number of tokens exceeds moe_max_num_tokens, the input tensors will be split into chunks and a for loop will be used.

Validated by:

validate_cuda_graph_config

field print_iter_log: bool = False#

Print iteration logs.

Validated by:

validate_cuda_graph_config

field torch_compile_enable_userbuffers: bool = True#

When torch compile is enabled, userbuffers is enabled by default.

Validated by:

validate_cuda_graph_config

field torch_compile_enabled: bool = False#

Enable torch.compile optimization.

Validated by:

validate_cuda_graph_config

field torch_compile_fullgraph: bool = True#

Enable full graph compilation in torch.compile.

Validated by:

validate_cuda_graph_config

field torch_compile_inductor_enabled: bool = False#

Enable inductor backend in torch.compile.

Validated by:

validate_cuda_graph_config

field torch_compile_piecewise_cuda_graph: bool = False#

Enable piecewise CUDA graph in torch.compile.

Validated by:

validate_cuda_graph_config

field use_cuda_graph: bool = False#

If true, use CUDA graphs for decoding. CUDA graphs are only created for the batch sizes in cuda_graph_batch_sizes, and are enabled for batches that consist of decoding requests only (the reason is that it’s hard to capture a single graph with prefill requests since the input shapes are a function of the sequence lengths). Note that each CUDA graph can use up to 200 MB of extra memory.

Validated by:

validate_cuda_graph_config

field use_kv_cache: bool = True#

Whether to use KV cache.

Validated by:

validate_cuda_graph_config

validator validate_cuda_graph_config » all fields[source]#

Validate CUDA graph configuration.

Ensures that: 1. If cuda_graph_batch_sizes is provided, cuda_graph_max_batch_size must be 0 2. If cuda_graph_batch_sizes is not provided, it is generated based on cuda_graph_max_batch_size 3. If both are provided, cuda_graph_batch_sizes must match the generated values

validator validate_cuda_graph_max_batch_size » cuda_graph_max_batch_size[source]#: Validate cuda_graph_max_batch_size is non-negative.

class tensorrt_llm.llmapi.TrtLlmArgs(

*,

model: str | ~pathlib.Path,

tokenizer: str | ~pathlib.Path | ~transformers.tokenization_utils_base.PreTrainedTokenizerBase | ~tensorrt_llm.llmapi.tokenizer.TokenizerBase | None = None,

tokenizer_mode: ~typing.Literal['auto',

'slow'] = 'auto',

skip_tokenizer_init: bool = False,

trust_remote_code: bool = False,

tensor_parallel_size: int = 1,

dtype: str = 'auto',

revision: str | None = None,

tokenizer_revision: str | None = None,

pipeline_parallel_size: int = 1,

context_parallel_size: int = 1,

gpus_per_node: int | None = None,

moe_cluster_parallel_size: int | None = None,

moe_tensor_parallel_size: int | None = None,

moe_expert_parallel_size: int | None = None,

enable_attention_dp: bool = False,

cp_config: dict | None = <factory>,

load_format: ~typing.Literal['auto',

'dummy'] = 'auto',

enable_lora: bool = False,

max_lora_rank: int | None = None,

max_loras: int = 4,

max_cpu_loras: int = 4,

lora_config: ~tensorrt_llm.lora_manager.LoraConfig | None = None,

enable_prompt_adapter: bool = False,

max_prompt_adapter_token: int = 0,

quant_config: ~tensorrt_llm.models.modeling_utils.QuantConfig | None = None,

kv_cache_config: ~tensorrt_llm.llmapi.llm_args.KvCacheConfig = <factory>,

enable_chunked_prefill: bool = False,

guided_decoding_backend: str | None = None,

batched_logits_processor: object | None = None,

iter_stats_max_iterations: int | None = None,

request_stats_max_iterations: int | None = None,

peft_cache_config: ~tensorrt_llm.llmapi.llm_args.PeftCacheConfig | None = None,

scheduler_config: ~tensorrt_llm.llmapi.llm_args.SchedulerConfig = <factory>,

cache_transceiver_config: ~tensorrt_llm.llmapi.llm_args.CacheTransceiverConfig | None = None,

speculative_config: ~tensorrt_llm.llmapi.llm_args.LookaheadDecodingConfig | ~tensorrt_llm.llmapi.llm_args.MedusaDecodingConfig | ~tensorrt_llm.llmapi.llm_args.EagleDecodingConfig | ~tensorrt_llm.llmapi.llm_args.MTPDecodingConfig | ~tensorrt_llm.llmapi.llm_args.NGramDecodingConfig | None = None,

batching_type: ~tensorrt_llm.llmapi.llm_args.BatchingType | None = None,

normalize_log_probs: bool = False,

max_batch_size: int | None = None,

max_input_len: int = 1024,

max_seq_len: int | None = None,

max_beam_width: int = 1,

max_num_tokens: int | None = None,

backend: str | None = None,

gather_generation_logits: bool = False,

_num_postprocess_workers: int = 0,

_postprocess_tokenizer_dir: str | None = None,

_reasoning_parser: str | None = None,

decoding_config: object | None = None,

_mpi_session: object | None = None,

auto_parallel: bool = False,

auto_parallel_world_size: int | None = None,

enable_tqdm: bool = False,

build_config: object | None = None,

workspace: str | None = None,

enable_build_cache: object = False,

extended_runtime_perf_knob_config: ~tensorrt_llm.llmapi.llm_args.ExtendedRuntimePerfKnobConfig | None = None,

calib_config: ~tensorrt_llm.llmapi.llm_args.CalibConfig | None = None,

embedding_parallel_mode: str = 'SHARDING_ALONG_VOCAB',

fast_build: bool = False,

**extra_data: ~typing.Any,

)[source]#

Bases: BaseLlmArgs

auto_parallel: bool#

Read-only data descriptor used to emit a runtime deprecation warning before accessing a deprecated field.

msg#: The deprecation message to be emitted.

wrapped_property#: The property instance if the deprecated field is a computed field, or None.

field_name#: The name of the field being deprecated.

property auto_parallel_config: AutoParallelConfig#

auto_parallel_world_size: int | None#

Read-only data descriptor used to emit a runtime deprecation warning before accessing a deprecated field.

msg#: The deprecation message to be emitted.

wrapped_property#: The property instance if the deprecated field is a computed field, or None.

field_name#: The name of the field being deprecated.

field build_config: object | None = None#: Build config.

field calib_config: CalibConfig | None = None#: Calibration config.

decoding_config: object | None#

Read-only data descriptor used to emit a runtime deprecation warning before accessing a deprecated field.

msg#: The deprecation message to be emitted.

wrapped_property#: The property instance if the deprecated field is a computed field, or None.

field_name#: The name of the field being deprecated.

field embedding_parallel_mode: str = 'SHARDING_ALONG_VOCAB'#: The embedding parallel mode.

field enable_build_cache: object = False#: Enable build cache.

field enable_tqdm: bool = False#: Enable tqdm for progress bar.

field extended_runtime_perf_knob_config: ExtendedRuntimePerfKnobConfig | None = None#: Extended runtime perf knob config.

field fast_build: bool = False#: Enable fast build.

max_cpu_loras: int#

Read-only data descriptor used to emit a runtime deprecation warning before accessing a deprecated field.

msg#: The deprecation message to be emitted.

wrapped_property#: The property instance if the deprecated field is a computed field, or None.

field_name#: The name of the field being deprecated.

max_lora_rank: int | None#

Read-only data descriptor used to emit a runtime deprecation warning before accessing a deprecated field.

msg#: The deprecation message to be emitted.

wrapped_property#: The property instance if the deprecated field is a computed field, or None.

field_name#: The name of the field being deprecated.

max_loras: int#

Read-only data descriptor used to emit a runtime deprecation warning before accessing a deprecated field.

msg#: The deprecation message to be emitted.

wrapped_property#: The property instance if the deprecated field is a computed field, or None.

field_name#: The name of the field being deprecated.

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'allow'}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(_TrtLlmArgs__context)[source]#: Override this method to perform additional initialization after __init__ and model_construct. This is useful if you want to do some validation that requires the entire model to be initialized.

field workspace: str | None = None#: The workspace for the model.