API Reference#
- class tensorrt_llm.llmapi.LLM(
- model: str | Path,
- tokenizer: str | Path | PreTrainedTokenizerBase | TokenizerBase | None = None,
- tokenizer_mode: Literal['auto', 'slow'] = 'auto',
- skip_tokenizer_init: bool = False,
- trust_remote_code: bool = False,
- tensor_parallel_size: int = 1,
- dtype: str = 'auto',
- revision: str | None = None,
- tokenizer_revision: str | None = None,
- **kwargs: Any,
Bases:
object
LLM class is the main class for running a LLM model.
- Parameters:
model (Union[str, pathlib.Path]) – The path to the model checkpoint or the model name from the Hugging Face Hub.
tokenizer (Union[str, pathlib.Path, transformers.tokenization_utils_base.PreTrainedTokenizerBase, tensorrt_llm.llmapi.tokenizer.TokenizerBase, NoneType]) – The path to the tokenizer checkpoint or the tokenizer name from the Hugging Face Hub. Defaults to None.
tokenizer_mode (Literal['auto', 'slow']) – The mode to initialize the tokenizer. Defaults to auto.
skip_tokenizer_init (bool) – Whether to skip the tokenizer initialization. Defaults to False.
trust_remote_code (bool) – Whether to trust the remote code. Defaults to False.
tensor_parallel_size (int) – The tensor parallel size. Defaults to 1.
dtype (str) – The data type to use for the model. Defaults to auto.
revision (Optional[str]) – The revision to use for the model. Defaults to None.
tokenizer_revision (Optional[str]) – The revision to use for the tokenizer. Defaults to None.
pipeline_parallel_size (int) – The pipeline parallel size. Defaults to 1.
context_parallel_size (int) – The context parallel size. Defaults to 1.
gpus_per_node (Optional[int]) – The number of GPUs per node. Defaults to None.
moe_tensor_parallel_size (Optional[int]) – The tensor parallel size for MoE models’s expert weights. Defaults to None.
moe_expert_parallel_size (Optional[int]) – The expert parallel size for MoE models’s expert weights. Defaults to None.
enable_attention_dp (bool) – Enable attention data parallel. Defaults to False.
cp_config (Optional[dict]) – Context parallel config. Defaults to None.
auto_parallel (bool) – Enable auto parallel mode. Defaults to False.
auto_parallel_world_size (Optional[int]) – The world size for auto parallel mode. Defaults to None.
load_format (Literal['auto', 'dummy']) – The format to load the model. Defaults to auto.
enable_tqdm (bool) – Enable tqdm for progress bar. Defaults to False.
enable_lora (bool) – Enable LoRA. Defaults to False.
max_lora_rank (Optional[int]) – The maximum LoRA rank. Defaults to None.
max_loras (int) – The maximum number of LoRA. Defaults to 4.
max_cpu_loras (int) – The maximum number of LoRA on CPU. Defaults to 4.
enable_prompt_adapter (bool) – Enable prompt adapter. Defaults to False.
max_prompt_adapter_token (int) – The maximum number of prompt adapter tokens. Defaults to 0.
quant_config (Optional[tensorrt_llm.models.modeling_utils.QuantConfig]) – Quantization config. Defaults to None.
calib_config (Optional[tensorrt_llm.llmapi.llm_args.CalibConfig]) – Calibration config. Defaults to None.
build_config (Optional[tensorrt_llm.builder.BuildConfig]) – Build config. Defaults to None.
kv_cache_config (Optional[tensorrt_llm.llmapi.llm_args.KvCacheConfig]) – KV cache config. Defaults to None.
enable_chunked_prefill (bool) – Enable chunked prefill. Defaults to False.
guided_decoding_backend (Optional[str]) – Guided decoding backend. Defaults to None.
batched_logits_processor (Optional[tensorrt_llm.sampling_params.BatchedLogitsProcessor]) – Batched logits processor. Defaults to None.
iter_stats_max_iterations (Optional[int]) – The maximum number of iterations for iter stats. Defaults to None.
request_stats_max_iterations (Optional[int]) – The maximum number of iterations for request stats. Defaults to None.
workspace (Optional[str]) – The workspace for the model. Defaults to None.
embedding_parallel_mode (str) – The embedding parallel mode. Defaults to SHARDING_ALONG_VOCAB.
fast_build (bool) – Enable fast build. Defaults to False.
enable_build_cache (Union[tensorrt_llm.llmapi.build_cache.BuildCacheConfig, bool]) – Enable build cache. Defaults to False.
peft_cache_config (Optional[tensorrt_llm.llmapi.llm_args.PeftCacheConfig]) – PEFT cache config. Defaults to None.
scheduler_config (Optional[tensorrt_llm.llmapi.llm_args.SchedulerConfig]) – Scheduler config. Defaults to None.
speculative_config (Union[tensorrt_llm.llmapi.llm_args.LookaheadDecodingConfig, tensorrt_llm.llmapi.llm_args.MedusaDecodingConfig, tensorrt_llm.llmapi.llm_args.EagleDecodingConfig, tensorrt_llm.llmapi.llm_args.MTPDecodingConfig, NoneType]) – Speculative decoding config. Defaults to None.
batching_type (Optional[tensorrt_llm.llmapi.llm_args.BatchingType]) – Batching type. Defaults to None.
normalize_log_probs (bool) – Normalize log probabilities. Defaults to False.
gather_generation_logits (bool) – Gather generation logits. Defaults to False.
extended_runtime_perf_knob_config (Optional[tensorrt_llm.llmapi.llm_args.ExtendedRuntimePerfKnobConfig]) – Extended runtime perf knob config. Defaults to None.
max_batch_size (Optional[int]) – The maximum batch size. Defaults to None.
max_input_len (int) – The maximum input length. Defaults to 1024.
max_seq_len (Optional[int]) – The maximum sequence length. Defaults to None.
max_beam_width (int) – The maximum beam width. Defaults to 1.
max_num_tokens (Optional[int]) – The maximum number of tokens. Defaults to None.
backend (Optional[str]) – The backend to use. Defaults to None.
kwargs (Any) – Advanced arguments passed to LlmArgs.
- tokenizer#
The tokenizer loaded by LLM instance, if any.
- Type:
tensorrt_llm.llmapi.tokenizer.TokenizerBase, optional
- workspace#
The directory to store intermediate files.
- Type:
pathlib.Path
- __init__(
- model: str | Path,
- tokenizer: str | Path | PreTrainedTokenizerBase | TokenizerBase | None = None,
- tokenizer_mode: Literal['auto', 'slow'] = 'auto',
- skip_tokenizer_init: bool = False,
- trust_remote_code: bool = False,
- tensor_parallel_size: int = 1,
- dtype: str = 'auto',
- revision: str | None = None,
- tokenizer_revision: str | None = None,
- **kwargs: Any,
- generate(
- inputs: str | List[int] | TextPrompt | TokensPrompt | Sequence[str | List[int] | TextPrompt | TokensPrompt],
- sampling_params: SamplingParams | List[SamplingParams] | None = None,
- use_tqdm: bool = True,
- lora_request: LoRARequest | Sequence[LoRARequest] | None = None,
- prompt_adapter_request: PromptAdapterRequest | Sequence[PromptAdapterRequest] | None = None,
- queries: str | List[int] | TextPrompt | TokensPrompt | Sequence[str | List[int] | TextPrompt | TokensPrompt] | None = None,
- kv_cache_retention_config: KvCacheRetentionConfig | None = None,
- disaggregated_params: DisaggregatedParams | None = None,
Generate output for the given prompts in the synchronous mode. Synchronous generation accepts either single prompt or batched prompts.
- Parameters:
inputs (tensorrt_llm.inputs.data.PromptInputs, Sequence[tensorrt_llm.inputs.data.PromptInputs]) – The prompt text or token ids. It can be single prompt or batched prompts.
sampling_params (tensorrt_llm.sampling_params.SamplingParams, List[tensorrt_llm.sampling_params.SamplingParams], optional) – The sampling params for the generation. Defaults to None. A default one will be used if not provided.
use_tqdm (bool) – Whether to use tqdm to display the progress bar. Defaults to True.
lora_request (tensorrt_llm.executor.request.LoRARequest, Sequence[tensorrt_llm.executor.request.LoRARequest], optional) – LoRA request to use for generation, if any. Defaults to None.
prompt_adapter_request (tensorrt_llm.executor.request.PromptAdapterRequest, Sequence[tensorrt_llm.executor.request.PromptAdapterRequest], optional) – Prompt Adapter request to use for generation, if any. Defaults to None.
queries (tensorrt_llm.inputs.data.PromptInputs, Sequence[tensorrt_llm.inputs.data.PromptInputs], optional) – The query text or token ids. Defaults to None. it can be single prompt or batched prompts. it is used for star attention to run long context tasks.
kv_cache_retention_config (tensorrt_llm.bindings.executor.KvCacheRetentionConfig, optional) – Configuration for the request’s retention in the KV Cache. Defaults to None.
disaggregated_params (tensorrt_llm.disaggregated_params.DisaggregatedParams, optional) – Disaggregated parameters. Defaults to None.
- Returns:
The output data of the completion request to the LLM.
- Return type:
Union[tensorrt_llm.llmapi.RequestOutput, List[tensorrt_llm.llmapi.RequestOutput]]
- generate_async(
- inputs: str | List[int] | TextPrompt | TokensPrompt,
- sampling_params: SamplingParams | None = None,
- lora_request: LoRARequest | None = None,
- prompt_adapter_request: PromptAdapterRequest | None = None,
- streaming: bool = False,
- queries: str | List[int] | TextPrompt | TokensPrompt | None = None,
- kv_cache_retention_config: KvCacheRetentionConfig | None = None,
- disaggregated_params: DisaggregatedParams | None = None,
- _postproc_params: PostprocParams | None = None,
Generate output for the given prompt in the asynchronous mode. Asynchronous generation accepts single prompt only.
- Parameters:
inputs (tensorrt_llm.inputs.data.PromptInputs) – The prompt text or token ids; it must be single prompt.
sampling_params (tensorrt_llm.sampling_params.SamplingParams, optional) – The sampling params for the generation. Defaults to None. A default one will be used if not provided.
lora_request (tensorrt_llm.executor.request.LoRARequest, optional) – LoRA request to use for generation, if any. Defaults to None.
prompt_adapter_request (tensorrt_llm.executor.request.PromptAdapterRequest, optional) – Prompt Adapter request to use for generation, if any. Defaults to None.
streaming (bool) – Whether to use the streaming mode for the generation. Defaults to False.
queries (tensorrt_llm.inputs.data.PromptInputs, optional) – The query text or token ids. Defaults to None. It can be single prompt or batched prompts. it is used for star attention to run long context tasks.
kv_cache_retention_config (tensorrt_llm.bindings.executor.KvCacheRetentionConfig, optional) – Configuration for the request’s retention in the KV Cache. Defaults to None.
disaggregated_params (tensorrt_llm.disaggregated_params.DisaggregatedParams, optional) – Disaggregated parameters. Defaults to None.
- Returns:
The output data of the completion request to the LLM.
- Return type:
- get_kv_cache_events(
- timeout: float | None = 2,
Get iteration KV events from the runtime.
- KV events are used to track changes and operations within the KV Cache. Types of events:
KVCacheCreatedData: Indicates the creation of cache blocks.
KVCacheStoredData: Represents a sequence of stored blocks.
KVCacheRemovedData: Contains the hashes of blocks that are being removed from the cache.
KVCacheUpdatedData: Captures updates to existing cache blocks.
- To enable KV events:
set event_buffer_max_size to a positive integer in the KvCacheConfig.
set enable_block_reuse to True in the KvCacheConfig.
- Parameters:
timeout (float, optional) – Max wait time in seconds when retrieving events from queue. Defaults to 2.
- Returns:
A list of runtime events as dict.
- Return type:
List[dict]
- get_kv_cache_events_async(
- timeout: float | None = 2,
Get iteration KV events from the runtime.
- KV events are used to track changes and operations within the KV Cache. Types of events:
KVCacheCreatedData: Indicates the creation of cache blocks.
KVCacheStoredData: Represents a sequence of stored blocks.
KVCacheRemovedData: Contains the hashes of blocks that are being removed from the cache.
KVCacheUpdatedData: Captures updates to existing cache blocks.
- To enable KV events:
set event_buffer_max_size to a positive integer in the KvCacheConfig.
set enable_block_reuse to True in the KvCacheConfig.
- Parameters:
timeout (float, optional) – Max wait time in seconds when retrieving events from queue. . Defaults to 2.
- Returns:
An async iterable object containing runtime events.
- Return type:
tensorrt_llm.executor.result.IterationResult
- get_stats(timeout: float | None = 2) List[dict] [source]#
Get iteration statistics from the runtime. To collect statistics, call this function after prompts have been submitted with LLM().generate().
- Parameters:
timeout (float, optional) – Max wait time in seconds when retrieving stats from queue. Defaults to 2.
- Returns:
- A list of runtime stats as dict.
e.g., [‘{“cpuMemUsage”: …, “iter”: 0, …}’, ‘{“cpuMemUsage”: …, “iter”: 1, …}’]
- Return type:
List[dict]
- get_stats_async(
- timeout: float | None = 2,
Get iteration statistics from the runtime. To collect statistics, you can call this function in an async coroutine or the /metrics endpoint (if you’re using trtllm-serve) after prompts have been submitted.
- Parameters:
timeout (float, optional) – Max wait time in seconds when retrieving stats from queue. Defaults to 2.
- Returns:
An async iterable object containing runtime stats.
- Return type:
tensorrt_llm.executor.result.IterationResult
- save(engine_dir: str) None [source]#
Save the built engine to the given path.
- Parameters:
engine_dir (str) – The path to save the engine.
- property tokenizer: TokenizerBase | None#
- property workspace: Path#
- class tensorrt_llm.llmapi.CompletionOutput(
- index: int,
- text: str = '',
- token_ids: List[int] | None = None,
- cumulative_logprob: float | None = None,
- logprobs: List[float] | None = None,
- finish_reason: Literal['stop', 'length', 'timeout', 'cancelled'] | None = None,
- stop_reason: int | str | None = None,
- generation_logits: Tensor | None = None,
- disaggregated_params: DisaggregatedParams | None = None,
- _postprocess_result: Any = None,
Bases:
object
The output data of one completion output of a request.
- Parameters:
index (int) – The index of the output in the request.
text (str) – The generated output text. Defaults to “”.
token_ids (List[int], optional) – The token ids of the generated output text. Defaults to None.
cumulative_logprob (float, optional) – The cumulative log probability of the generated output text. Defaults to None.
logprobs (List[float], optional) – The log probabilities of the top probability words at each position if the logprobs are requested. Defaults to None.
finish_reason (Literal['stop', 'length', 'timeout', 'cancelled'], optional) – The reason why the sequence is finished. Defaults to None.
stop_reason (int, str, optional) – The stop string or token id that caused the completion to stop, None if the completion finished for some other reason. Defaults to None.
generation_logits (torch.Tensor, optional) – The logits on the generated output token ids. Defaults to None.
disaggregated_params (tensorrt_llm.disaggregated_params.DisaggregatedParams, optional) – Parameters needed for disaggregated serving. Includes the type of request, the first generated tokens, the context request id and the any additional state needing to be transferred from context and generation instances. Defaults to None.
- length#
The number of generated tokens.
- Type:
int
- token_ids_diff#
Newly generated token ids.
- Type:
List[int]
- logprobs_diff#
Logprobs of newly generated tokens.
- Type:
List[float]
- text_diff#
Newly generated tokens.
- Type:
str
- __init__(
- index: int,
- text: str = '',
- token_ids: List[int] | None = None,
- cumulative_logprob: float | None = None,
- logprobs: List[float] | None = None,
- finish_reason: Literal['stop', 'length', 'timeout', 'cancelled'] | None = None,
- stop_reason: int | str | None = None,
- generation_logits: Tensor | None = None,
- disaggregated_params: DisaggregatedParams | None = None,
- _postprocess_result: Any = None,
- cumulative_logprob: float | None#
- disaggregated_params: DisaggregatedParams | None#
- finish_reason: Literal['stop', 'length', 'timeout', 'cancelled'] | None#
- generation_logits: Tensor | None#
- index: int#
- property length: int#
- logprobs: List[float] | None#
- property logprobs_diff: List[float]#
- stop_reason: int | str | None#
- text: str#
- property text_diff: str#
- token_ids: List[int] | None#
- property token_ids_diff: List[int]#
- class tensorrt_llm.llmapi.RequestOutput[source]#
Bases:
DetokenizedGenerationResultBase
,GenerationResult
The output data of a completion request to the LLM.
- request_id#
The unique ID of the request.
- Type:
int
- prompt#
The prompt string of the request.
- Type:
str, optional
- prompt_token_ids#
The token ids of the prompt.
- Type:
List[int]
- outputs#
The output sequences of the request.
- Type:
List[CompletionOutput]
- context_logits#
The logits on the prompt token ids.
- Type:
torch.Tensor, optional
- finished#
Whether the whole request is finished.
- Type:
bool
- property prompt: str | None#
- class tensorrt_llm.llmapi.GuidedDecodingParams(
- *,
- json: str | BaseModel | dict | None = None,
- regex: str | None = None,
- grammar: str | None = None,
- json_object: bool = False,
Bases:
object
Guided decoding parameters for text generation. Only one of the fields could be effective.
- Parameters:
json (str, pydantic.main.BaseModel, dict, optional) – The generated text is amenable to json format with additional user-specified restrictions, namely schema. Defaults to None.
regex (str, optional) – The generated text is amenable to the user-specified regular expression. Defaults to None.
grammar (str, optional) – The generated text is amenable to the user-specified extended Backus-Naur form (EBNF) grammar. Defaults to None.
json_object (bool) – If True, the generated text is amenable to json format. Defaults to False.
- __init__(
- *,
- json: str | BaseModel | dict | None = None,
- regex: str | None = None,
- grammar: str | None = None,
- json_object: bool = False,
- grammar: str | None#
- json: str | BaseModel | dict | None#
- json_object: bool#
- regex: str | None#
- class tensorrt_llm.llmapi.SamplingParams(
- *,
- end_id: int | None = None,
- pad_id: int | None = None,
- max_tokens: int = 32,
- max_new_tokens: int | None = None,
- bad: str | List[str] | None = None,
- bad_token_ids: List[int] | None = None,
- stop: str | List[str] | None = None,
- stop_token_ids: List[int] | None = None,
- include_stop_str_in_output: bool = False,
- embedding_bias: Tensor | None = None,
- logits_processor: LogitsProcessor | None = None,
- apply_batched_logits_processor: bool = False,
- n: int = 1,
- best_of: int | None = None,
- use_beam_search: bool = False,
- beam_width: int = 1,
- num_return_sequences: int | None = None,
- top_k: int | None = None,
- top_p: float | None = None,
- top_p_min: float | None = None,
- top_p_reset_ids: int | None = None,
- top_p_decay: float | None = None,
- seed: int | None = None,
- random_seed: int | None = None,
- temperature: float | None = None,
- min_tokens: int | None = None,
- min_length: int | None = None,
- beam_search_diversity_rate: float | None = None,
- repetition_penalty: float | None = None,
- presence_penalty: float | None = None,
- frequency_penalty: float | None = None,
- length_penalty: float | None = None,
- early_stopping: int | None = None,
- no_repeat_ngram_size: int | None = None,
- min_p: float | None = None,
- beam_width_array: List[int] | None = None,
- return_log_probs: bool = False,
- return_context_logits: bool = False,
- return_generation_logits: bool = False,
- exclude_input_from_output: bool = True,
- return_encoder_output: bool = False,
- return_perf_metrics: bool = False,
- additional_model_outputs: List[AdditionalModelOutput] | None = None,
- lookahead_config: LookaheadDecodingConfig | None = None,
- guided_decoding: GuidedDecodingParams | None = None,
- ignore_eos: bool = False,
- detokenize: bool = True,
- add_special_tokens: bool = True,
- truncate_prompt_tokens: int | None = None,
- skip_special_tokens: bool = True,
- spaces_between_special_tokens: bool = True,
Bases:
object
Sampling parameters for text generation.
- Parameters:
end_id (int, optional) – The end token id. Defaults to None.
pad_id (int, optional) – The pad token id. Defaults to None.
max_tokens (int) – The maximum number of tokens to generate. Defaults to 32.
max_new_tokens (int, optional) – The maximum number of tokens to generate. This argument is being deprecated; please use max_tokens instead. Defaults to None.
bad (str, List[str], optional) – A string or a list of strings that redirect the generation when they are generated, so that the bad strings are excluded from the returned output. Defaults to None.
bad_token_ids (List[int], optional) – A list of token ids that redirect the generation when they are generated, so that the bad ids are excluded from the returned output. Defaults to None.
stop (str, List[str], optional) – A string or a list of strings that stop the generation when they are generated. The returned output will not contain the stop strings unless include_stop_str_in_output is True. Defaults to None.
stop_token_ids (List[int], optional) – A list of token ids that stop the generation when they are generated. Defaults to None.
include_stop_str_in_output (bool) – Whether to include the stop strings in output text. Defaults to False.
embedding_bias (torch.Tensor, optional) – The embedding bias tensor. Expected type is kFP32 and shape is [vocab_size]. Defaults to None.
logits_processor (tensorrt_llm.sampling_params.LogitsProcessor, optional) – The logits postprocessor callback. Defaults to None. The LogitsProcessor class is recommended for callback creation.
apply_batched_logits_processor (bool) – Whether to apply batched logits postprocessor callback. Defaults to False. The BatchedLogitsProcessor class is recommended for callback creation. The callback must be provided when initializing LLM.
n (int) – Number of sequences to generate. Defaults to 1.
best_of (int, optional) – Number of sequences to consider for best output. Defaults to None.
use_beam_search (bool) – Whether to use beam search. Defaults to False.
beam_width (int) – The beam width. Setting 1 disables beam search. This parameter will be deprecated from the LLM API in a future release. Please use n/best_of/use_beam_search instead. Defaults to 1.
num_return_sequences (int, optional) – The number of sequences to return. If set to None, it defaults to the value of beam_width. This parameter will be deprecated from the LLM API in a future release. Please use n/best_of/use_beam_search instead. Defaults to None.
top_k (int, optional) – Controls number of logits to sample from. None means using C++ runtime default 0, i.e., all logits. Defaults to None.
top_p (float, optional) – Controls the top-P probability to sample from. None means using C++ runtime default 0.f. Defaults to None.
top_p_min (float, optional) – Controls decay in the top-P algorithm. topPMin is lower-bound. None means using C++ runtime default 1.e-6. Defaults to None.
top_p_reset_ids (int, optional) – Controls decay in the top-P algorithm. Indicates where to reset the decay. None means using C++ runtime default 1. Defaults to None.
top_p_decay (float, optional) – Controls decay in the top-P algorithm. The decay value. None means using C++ runtime default 1.f. Defaults to None.
seed (int, optional) – Controls the random seed used by the random number generator in sampling. None means using C++ runtime default 0. Defaults to None.
random_seed (int, optional) – This argument is being deprecated; please use seed instead. Defaults to None.
temperature (float, optional) – Controls the modulation of logits when sampling new tokens. It can have values > 0.f. None means using C++ runtime default 1.0f. Defaults to None.
min_tokens (int, optional) – Lower bound on the number of tokens to generate. Values < 1 have no effect. None means using C++ runtime default 1. Defaults to None.
min_length (int, optional) – This argument is being deprecated; please use min_tokens instead. Defaults to None.
beam_search_diversity_rate (float, optional) – Used to penalize tokens based on how often they appear in the sequence. It can have any value > 0.f. Values < 1.f encourages repetition, values > 1.f discourages it. None means using C++ runtime default 1.f. Defaults to None.
repetition_penalty (float, optional) – Used to penalize tokens based on how often they appear in the sequence. It can have any value > 0.f. Values < 1.f encourages repetition, values > 1.f discourages it. None means using C++ runtime default 1.f. Defaults to None.
presence_penalty (float, optional) – Used to penalize tokens already present in the sequence (irrespective of the number of appearances). It can have any values. Values < 0.f encourage repetition, values > 0.f discourage it. None means using C++ runtime default 0.f. Defaults to None.
frequency_penalty (float, optional) – Used to penalize tokens already present in the sequence (dependent on the number of appearances). It can have any values. Values < 0.f encourage repetition, values > 0.f discourage it. None means using C++ runtime default 0.f. Defaults to None.
length_penalty (float, optional) – Controls how to penalize longer sequences in beam search. None means using C++ runtime default 0.f. Defaults to None.
early_stopping (int, optional) – Controls whether the generation process finishes once beamWidth sentences are generated (ends with end_token). None means using C++ runtime default 1. Defaults to None.
no_repeat_ngram_size (int, optional) – Controls how many repeat ngram size are acceptable. None means using C++ runtime default 1 << 30. Defaults to None.
min_p (float, optional) – scale the most likely token to determine the minimum token probability. None means using C++ runtime default 0.0. Defaults to None.
beam_width_array (List[int], optional) – The array of beam width using in Variable-Beam-Width-Search. Defaults to None.
return_log_probs (bool) – Controls if Result should contain log probabilities. Defaults to False.
return_context_logits (bool) – Controls if Result should contain the context logits. Defaults to False.
return_generation_logits (bool) – Controls if Result should contain the generation logits. Defaults to False.
exclude_input_from_output (bool) – Controls if output tokens in Result should include the input tokens. Defaults to True.
return_encoder_output (bool) – Controls if Result should contain encoder output hidden states (for encoder-only and encoder-decoder models). Defaults to False.
return_perf_metrics (bool) – Controls if Result should contain the performance metrics for this request. Defaults to False.
additional_model_outputs (List[tensorrt_llm.sampling_params.AdditionalModelOutput], optional) – The additional outputs to gather from the model. Defaults to None.
lookahead_config (tensorrt_llm.bindings.executor.LookaheadDecodingConfig , optional) – Lookahead decoding config. Defaults to None.
guided_decoding (tensorrt_llm.sampling_params.GuidedDecodingParams, optional) – Guided decoding params. Defaults to None.
ignore_eos (bool) – Whether to ignore the EOS token and continue generating tokens after the EOS token is generated. Defaults to False.
detokenize (bool) – Whether to detokenize the output. Defaults to True.
add_special_tokens (bool) – Whether to add special tokens to the prompt. Defaults to True.
truncate_prompt_tokens (int, optional) – If set to an integer k, will use only the last k tokens from the prompt (i.e., left truncation). Defaults to None.
skip_special_tokens (bool) – Whether to skip special tokens in the output. Defaults to True.
spaces_between_special_tokens (bool) – Whether to add spaces between special tokens in the output. Defaults to True.
- __init__(
- *,
- end_id: int | None = None,
- pad_id: int | None = None,
- max_tokens: int = 32,
- max_new_tokens: int | None = None,
- bad: str | List[str] | None = None,
- bad_token_ids: List[int] | None = None,
- stop: str | List[str] | None = None,
- stop_token_ids: List[int] | None = None,
- include_stop_str_in_output: bool = False,
- embedding_bias: Tensor | None = None,
- logits_processor: LogitsProcessor | None = None,
- apply_batched_logits_processor: bool = False,
- n: int = 1,
- best_of: int | None = None,
- use_beam_search: bool = False,
- beam_width: int = 1,
- num_return_sequences: int | None = None,
- top_k: int | None = None,
- top_p: float | None = None,
- top_p_min: float | None = None,
- top_p_reset_ids: int | None = None,
- top_p_decay: float | None = None,
- seed: int | None = None,
- random_seed: int | None = None,
- temperature: float | None = None,
- min_tokens: int | None = None,
- min_length: int | None = None,
- beam_search_diversity_rate: float | None = None,
- repetition_penalty: float | None = None,
- presence_penalty: float | None = None,
- frequency_penalty: float | None = None,
- length_penalty: float | None = None,
- early_stopping: int | None = None,
- no_repeat_ngram_size: int | None = None,
- min_p: float | None = None,
- beam_width_array: List[int] | None = None,
- return_log_probs: bool = False,
- return_context_logits: bool = False,
- return_generation_logits: bool = False,
- exclude_input_from_output: bool = True,
- return_encoder_output: bool = False,
- return_perf_metrics: bool = False,
- additional_model_outputs: List[AdditionalModelOutput] | None = None,
- lookahead_config: LookaheadDecodingConfig | None = None,
- guided_decoding: GuidedDecodingParams | None = None,
- ignore_eos: bool = False,
- detokenize: bool = True,
- add_special_tokens: bool = True,
- truncate_prompt_tokens: int | None = None,
- skip_special_tokens: bool = True,
- spaces_between_special_tokens: bool = True,
- add_special_tokens: bool#
- additional_model_outputs: List[AdditionalModelOutput] | None#
- apply_batched_logits_processor: bool#
- bad: str | List[str] | None#
- bad_token_ids: List[int] | None#
- beam_search_diversity_rate: float | None#
- beam_width: int#
- beam_width_array: List[int] | None#
- best_of: int | None#
- detokenize: bool#
- early_stopping: int | None#
- embedding_bias: Tensor | None#
- end_id: int | None#
- exclude_input_from_output: bool#
- frequency_penalty: float | None#
- guided_decoding: GuidedDecodingParams | None#
- ignore_eos: bool#
- include_stop_str_in_output: bool#
- length_penalty: float | None#
- logits_processor: LogitsProcessor | None#
- lookahead_config: LookaheadDecodingConfig | None#
- max_new_tokens: int | None#
- max_tokens: int#
- min_length: int | None#
- min_p: float | None#
- min_tokens: int | None#
- n: int#
- no_repeat_ngram_size: int | None#
- num_return_sequences: int | None#
- pad_id: int | None#
- presence_penalty: float | None#
- random_seed: int | None#
- repetition_penalty: float | None#
- return_context_logits: bool#
- return_encoder_output: bool#
- return_generation_logits: bool#
- return_log_probs: bool#
- return_perf_metrics: bool#
- seed: int | None#
- skip_special_tokens: bool#
- spaces_between_special_tokens: bool#
- stop: str | List[str] | None#
- stop_token_ids: List[int] | None#
- temperature: float | None#
- top_k: int | None#
- top_p: float | None#
- top_p_decay: float | None#
- top_p_min: float | None#
- top_p_reset_ids: int | None#
- truncate_prompt_tokens: int | None#
- use_beam_search: bool#
- class tensorrt_llm.llmapi.DisaggregatedParams(
- *,
- request_type: str | None = None,
- first_gen_tokens: List[int] | None = None,
- ctx_request_id: int | None = None,
- opaque_state: bytes | None = None,
- draft_tokens: List[int] | None = None,
Bases:
object
Disaggregated seving parameters
- Parameters:
request_type (str) – The type of request (“context_only” or “generation_only”)
first_gen_tokens (List[int]) – The first tokens of the generation request
ctx_request_id (int) – The context request id
opaque_state (bytes) – Any additional state needing to be exchanged between context and gen instances
- __init__(
- *,
- request_type: str | None = None,
- first_gen_tokens: List[int] | None = None,
- ctx_request_id: int | None = None,
- opaque_state: bytes | None = None,
- draft_tokens: List[int] | None = None,
- ctx_request_id: int | None#
- draft_tokens: List[int] | None#
- first_gen_tokens: List[int] | None#
- opaque_state: bytes | None#
- request_type: str | None#
- class tensorrt_llm.llmapi.KvCacheConfig(
- *,
- enable_block_reuse: bool = True,
- max_tokens: int | None = None,
- max_attention_window: List[int] | None = None,
- sink_token_length: int | None = None,
- free_gpu_memory_fraction: float | None = None,
- host_cache_size: int | None = None,
- onboard_blocks: bool = True,
- cross_kv_cache_fraction: float | None = None,
- secondary_offload_min_priority: int | None = None,
- event_buffer_max_size: int = 0,
- enable_partial_reuse: bool = True,
- copy_on_partial_reuse: bool = True,
Bases:
BaseModel
,PybindMirror
Configuration for the KV cache.
- field copy_on_partial_reuse: bool = True#
Whether partially matched blocks that are in use can be reused after copying them.
- field cross_kv_cache_fraction: float | None = None#
The fraction of the KV Cache memory should be reserved for cross attention. If set to p, self attention will use 1-p of KV Cache memory and cross attention will use p of KV Cache memory. Default is 50%. Should only be set when using encoder-decoder model.
- field enable_block_reuse: bool = True#
Controls if KV cache blocks can be reused for different requests.
- field enable_partial_reuse: bool = True#
Whether blocks that are only partially matched can be reused.
- field event_buffer_max_size: int = 0#
Maximum size of the event buffer. If set to 0, the event buffer will not be used.
- field free_gpu_memory_fraction: float | None = None#
The fraction of GPU memory fraction that should be allocated for the KV cache. Default is 90%. If both max_tokens and free_gpu_memory_fraction are specified, memory corresponding to the minimum will be used.
- field host_cache_size: int | None = None#
Size of the host cache in bytes. If both max_tokens and host_cache_size are specified, memory corresponding to the minimum will be used.
- field max_attention_window: List[int] | None = None#
Size of the attention window for each sequence. Only the last tokens will be stored in the KV cache. If the number of elements in max_attention_window is less than the number of layers, max_attention_window will be repeated multiple times to the number of layers.
- field max_tokens: int | None = None#
The maximum number of tokens that should be stored in the KV cache. If both max_tokens and free_gpu_memory_fraction are specified, memory corresponding to the minimum will be used.
- model_config: ClassVar[ConfigDict] = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- field onboard_blocks: bool = True#
Controls if blocks are onboarded.
- field secondary_offload_min_priority: int | None = None#
Only blocks with priority > mSecondaryOfflineMinPriority can be offloaded to secondary memory.
- field sink_token_length: int | None = None#
Number of sink tokens (tokens to always keep in attention window).
- class tensorrt_llm.llmapi.KvCacheRetentionConfig#
Bases:
pybind11_object
- class TokenRangeRetentionConfig#
Bases:
pybind11_object
- __init__(
- self: tensorrt_llm.bindings.executor.KvCacheRetentionConfig.TokenRangeRetentionConfig,
- token_start: int,
- token_end: int | None,
- priority: int,
- duration_ms: datetime.timedelta | None = None,
- property duration_ms#
- property priority#
- property token_end#
- property token_start#
- __init__(
- self: tensorrt_llm.bindings.executor.KvCacheRetentionConfig,
- token_range_retention_configs: list[tensorrt_llm.bindings.executor.KvCacheRetentionConfig.TokenRangeRetentionConfig],
- decode_retention_priority: int = 35,
- decode_duration_ms: datetime.timedelta | None = None,
- property decode_duration_ms#
- property decode_retention_priority#
- property token_range_retention_configs#
- class tensorrt_llm.llmapi.LookaheadDecodingConfig(
- *,
- max_draft_len: int | None = None,
- speculative_model: str | Path | None = None,
- max_window_size: int = 4,
- max_ngram_size: int = 3,
- max_verification_set_size: int = 4,
Bases:
DecodingBaseConfig
,PybindMirror
Configuration for lookahead speculative decoding.
- __init__(**data)[source]#
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- decoding_type: ClassVar[str] = 'Lookahead'#
- field max_ngram_size: int = 3#
Number of tokens per NGram.
- field max_verification_set_size: int = 4#
Number of NGrams in verification branch per step.
- field max_window_size: int = 4#
Number of NGrams in lookahead branch per step.
- model_config: ClassVar[ConfigDict] = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class tensorrt_llm.llmapi.MedusaDecodingConfig(
- *,
- max_draft_len: int | None = None,
- speculative_model: str | Path | None = None,
- medusa_choices: List[List[int]] | None = None,
- num_medusa_heads: int | None = None,
Bases:
DecodingBaseConfig
- decoding_type: ClassVar[str] = 'Medusa'#
- field medusa_choices: List[List[int]] | None = None#
- model_config: ClassVar[ConfigDict] = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- field num_medusa_heads: int | None = None#
- class tensorrt_llm.llmapi.EagleDecodingConfig(
- *,
- max_draft_len: int | None = None,
- speculative_model: str | Path | None = None,
- eagle_choices: List[List[int]] | None = None,
- greedy_sampling: bool | None = True,
- posterior_threshold: float | None = None,
- use_dynamic_tree: bool | None = False,
- dynamic_tree_max_topK: int | None = None,
- num_eagle_layers: int | None = None,
- max_non_leaves_per_layer: int | None = None,
- pytorch_eagle_weights_path: str | None = None,
Bases:
DecodingBaseConfig
- decoding_type: ClassVar[str] = 'Eagle'#
- field dynamic_tree_max_topK: int | None = None#
- field eagle_choices: List[List[int]] | None = None#
- field greedy_sampling: bool | None = True#
- field max_non_leaves_per_layer: int | None = None#
- model_config: ClassVar[ConfigDict] = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- field num_eagle_layers: int | None = None#
- field posterior_threshold: float | None = None#
- field pytorch_eagle_weights_path: str | None = None#
- field use_dynamic_tree: bool | None = False#
- class tensorrt_llm.llmapi.MTPDecodingConfig(
- *,
- max_draft_len: int | None = None,
- speculative_model: str | Path | None = None,
- num_nextn_predict_layers: int | None = 1,
Bases:
DecodingBaseConfig
- decoding_type: ClassVar[str] = 'MTP'#
- model_config: ClassVar[ConfigDict] = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- field num_nextn_predict_layers: int | None = 1#
- class tensorrt_llm.llmapi.SchedulerConfig(
- *,
- capacity_scheduler_policy: CapacitySchedulerPolicy = CapacitySchedulerPolicy.GUARANTEED_NO_EVICT,
- context_chunking_policy: ContextChunkingPolicy | None = None,
- dynamic_batch_config: DynamicBatchConfig | None = None,
Bases:
BaseModel
,PybindMirror
- field capacity_scheduler_policy: CapacitySchedulerPolicy = CapacitySchedulerPolicy.GUARANTEED_NO_EVICT#
The capacity scheduler policy to use
- field context_chunking_policy: ContextChunkingPolicy | None = None#
The context chunking policy to use
- field dynamic_batch_config: DynamicBatchConfig | None = None#
The dynamic batch config to use
- model_config: ClassVar[ConfigDict] = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class tensorrt_llm.llmapi.CapacitySchedulerPolicy(
- value,
- names=<not given>,
- *values,
- module=None,
- qualname=None,
- type=None,
- start=1,
- boundary=None,
Bases:
StrEnum
- GUARANTEED_NO_EVICT = 'GUARANTEED_NO_EVICT'#
- MAX_UTILIZATION = 'MAX_UTILIZATION'#
- STATIC_BATCH = 'STATIC_BATCH'#
- class tensorrt_llm.llmapi.BuildConfig(
- max_input_len: int = 1024,
- max_seq_len: int = None,
- opt_batch_size: int = 8,
- max_batch_size: int = 2048,
- max_beam_width: int = 1,
- max_num_tokens: int = 8192,
- opt_num_tokens: Optional[int] = None,
- max_prompt_embedding_table_size: int = 0,
- kv_cache_type: tensorrt_llm.bindings.KVCacheType = None,
- gather_context_logits: int = False,
- gather_generation_logits: int = False,
- strongly_typed: bool = True,
- force_num_profiles: Optional[int] = None,
- profiling_verbosity: str = 'layer_names_only',
- enable_debug_output: bool = False,
- max_draft_len: int = 0,
- speculative_decoding_mode: tensorrt_llm.models.modeling_utils.SpeculativeDecodingMode = <SpeculativeDecodingMode.NONE: 1>,
- use_refit: bool = False,
- input_timing_cache: str = None,
- output_timing_cache: str = 'model.cache',
- lora_config: tensorrt_llm.lora_manager.LoraConfig = <factory>,
- auto_parallel_config: tensorrt_llm.auto_parallel.config.AutoParallelConfig = <factory>,
- weight_sparsity: bool = False,
- weight_streaming: bool = False,
- plugin_config: tensorrt_llm.plugin.plugin.PluginConfig = <factory>,
- use_strip_plan: bool = False,
- max_encoder_input_len: int = 1024,
- dry_run: bool = False,
- visualize_network: str = None,
- monitor_memory: bool = False,
- use_mrope: bool = False,
Bases:
object
- __init__(
- max_input_len: int = 1024,
- max_seq_len: int = None,
- opt_batch_size: int = 8,
- max_batch_size: int = 2048,
- max_beam_width: int = 1,
- max_num_tokens: int = 8192,
- opt_num_tokens: int | None = None,
- max_prompt_embedding_table_size: int = 0,
- kv_cache_type: ~tensorrt_llm.bindings.KVCacheType = None,
- gather_context_logits: int = False,
- gather_generation_logits: int = False,
- strongly_typed: bool = True,
- force_num_profiles: int | None = None,
- profiling_verbosity: str = 'layer_names_only',
- enable_debug_output: bool = False,
- max_draft_len: int = 0,
- speculative_decoding_mode: ~tensorrt_llm.models.modeling_utils.SpeculativeDecodingMode = <SpeculativeDecodingMode.NONE: 1>,
- use_refit: bool = False,
- input_timing_cache: str = None,
- output_timing_cache: str = 'model.cache',
- lora_config: ~tensorrt_llm.lora_manager.LoraConfig = <factory>,
- auto_parallel_config: ~tensorrt_llm.auto_parallel.config.AutoParallelConfig = <factory>,
- weight_sparsity: bool = False,
- weight_streaming: bool = False,
- plugin_config: ~tensorrt_llm.plugin.plugin.PluginConfig = <factory>,
- use_strip_plan: bool = False,
- max_encoder_input_len: int = 1024,
- dry_run: bool = False,
- visualize_network: str = None,
- monitor_memory: bool = False,
- use_mrope: bool = False,
- auto_parallel_config: AutoParallelConfig#
- dry_run: bool = False#
- enable_debug_output: bool = False#
- force_num_profiles: int | None = None#
- gather_context_logits: int = False#
- gather_generation_logits: int = False#
- input_timing_cache: str = None#
- kv_cache_type: KVCacheType = None#
- lora_config: LoraConfig#
- max_batch_size: int = 2048#
- max_beam_width: int = 1#
- max_draft_len: int = 0#
- max_encoder_input_len: int = 1024#
- max_input_len: int = 1024#
- max_num_tokens: int = 8192#
- max_prompt_embedding_table_size: int = 0#
- max_seq_len: int = None#
- monitor_memory: bool = False#
- opt_batch_size: int = 8#
- opt_num_tokens: int | None = None#
- output_timing_cache: str = 'model.cache'#
- plugin_config: PluginConfig#
- profiling_verbosity: str = 'layer_names_only'#
- speculative_decoding_mode: SpeculativeDecodingMode = 1#
- strongly_typed: bool = True#
- use_mrope: bool = False#
- use_refit: bool = False#
- use_strip_plan: bool = False#
- visualize_network: str = None#
- weight_sparsity: bool = False#
- weight_streaming: bool = False#
- class tensorrt_llm.llmapi.QuantConfig(
- quant_algo: QuantAlgo | None = None,
- kv_cache_quant_algo: QuantAlgo | None = None,
- group_size: int = 128,
- smoothquant_val: float = 0.5,
- clamp_val: List[float] | None = None,
- use_meta_recipe: bool = False,
- has_zero_point: bool = False,
- pre_quant_scale: bool = False,
- exclude_modules: List[str] | None = None,
Bases:
object
Serializable quantization configuration class, part of the PretrainedConfig.
- Parameters:
quant_algo (tensorrt_llm.quantization.mode.QuantAlgo, optional) – Quantization algorithm. Defaults to None.
kv_cache_quant_algo (tensorrt_llm.quantization.mode.QuantAlgo, optional) – KV cache quantization algorithm. Defaults to None.
group_size (int) – The group size for group-wise quantization. Defaults to 128.
smoothquant_val (float) – The smoothing parameter alpha used in smooth quant. Defaults to 0.5.
clamp_val (List[float], optional) – The clamp values used in FP8 rowwise quantization. Defaults to None.
use_meta_recipe (bool) – Whether to use Meta’s recipe for FP8 rowwise quantization. Defaults to False.
has_zero_point (bool) – Whether to use zero point for quantization. Defaults to False.
pre_quant_scale (bool) – Whether to use pre-quant scale for quantization. Defaults to False.
exclude_modules (List[str], optional) – The module name patterns that are skipped in quantization. Defaults to None.
- __init__(
- quant_algo: QuantAlgo | None = None,
- kv_cache_quant_algo: QuantAlgo | None = None,
- group_size: int = 128,
- smoothquant_val: float = 0.5,
- clamp_val: List[float] | None = None,
- use_meta_recipe: bool = False,
- has_zero_point: bool = False,
- pre_quant_scale: bool = False,
- exclude_modules: List[str] | None = None,
- clamp_val: List[float] | None = None#
- exclude_modules: List[str] | None = None#
- classmethod from_dict(
- config: dict,
Create a QuantConfig instance from a dict.
- Parameters:
config (dict) – The dict used to create QuantConfig.
- Returns:
The QuantConfig created from dict.
- Return type:
- group_size: int = 128#
- has_zero_point: bool = False#
- is_module_excluded_from_quantization(name: str) bool [source]#
Check if the module is excluded from quantization.
- Parameters:
name (str) – The name of the module.
- Returns:
True if the module is excluded from quantization, False otherwise.
- Return type:
bool
- pre_quant_scale: bool = False#
- property quant_mode: QuantModeWrapper#
- smoothquant_val: float = 0.5#
- to_dict() dict [source]#
Dump a QuantConfig instance to a dict.
- Returns:
The dict dumped from QuantConfig.
- Return type:
dict
- use_meta_recipe: bool = False#
- class tensorrt_llm.llmapi.QuantAlgo(
- value,
- names=<not given>,
- *values,
- module=None,
- qualname=None,
- type=None,
- start=1,
- boundary=None,
Bases:
StrEnum
- FP8 = 'FP8'#
- FP8_BLOCK_SCALES = 'FP8_BLOCK_SCALES'#
- FP8_PER_CHANNEL_PER_TOKEN = 'FP8_PER_CHANNEL_PER_TOKEN'#
- INT8 = 'INT8'#
- MIXED_PRECISION = 'MIXED_PRECISION'#
- NO_QUANT = 'NO_QUANT'#
- NVFP4 = 'NVFP4'#
- W4A16 = 'W4A16'#
- W4A16_AWQ = 'W4A16_AWQ'#
- W4A16_GPTQ = 'W4A16_GPTQ'#
- W4A8_AWQ = 'W4A8_AWQ'#
- W4A8_QSERVE_PER_CHANNEL = 'W4A8_QSERVE_PER_CHANNEL'#
- W4A8_QSERVE_PER_GROUP = 'W4A8_QSERVE_PER_GROUP'#
- W8A16 = 'W8A16'#
- W8A16_GPTQ = 'W8A16_GPTQ'#
- W8A8_SQ_PER_CHANNEL = 'W8A8_SQ_PER_CHANNEL'#
- W8A8_SQ_PER_CHANNEL_PER_TENSOR_PLUGIN = 'W8A8_SQ_PER_CHANNEL_PER_TENSOR_PLUGIN'#
- W8A8_SQ_PER_CHANNEL_PER_TOKEN_PLUGIN = 'W8A8_SQ_PER_CHANNEL_PER_TOKEN_PLUGIN'#
- W8A8_SQ_PER_TENSOR_PER_TOKEN_PLUGIN = 'W8A8_SQ_PER_TENSOR_PER_TOKEN_PLUGIN'#
- W8A8_SQ_PER_TENSOR_PLUGIN = 'W8A8_SQ_PER_TENSOR_PLUGIN'#
- class tensorrt_llm.llmapi.CalibConfig(
- *,
- device: Literal['cuda', 'cpu'] = 'cuda',
- calib_dataset: str = 'cnn_dailymail',
- calib_batches: int = 512,
- calib_batch_size: int = 1,
- calib_max_seq_length: int = 512,
- random_seed: int = 1234,
- tokenizer_max_seq_length: int = 2048,
Bases:
BaseModel
Calibration configuration.
- field calib_batch_size: int = 1#
The batch size that the calibration runs.
- field calib_batches: int = 512#
The number of batches that the calibration runs.
- field calib_dataset: str = 'cnn_dailymail'#
The name or local path of calibration dataset.
- field calib_max_seq_length: int = 512#
The maximum sequence length that the calibration runs.
- field device: Literal['cuda', 'cpu'] = 'cuda'#
The device to run calibration.
- classmethod from_dict(
- config: dict,
Create a CalibConfig instance from a dict.
- Parameters:
config (dict) – The dict used to create CalibConfig.
- Returns:
The CalibConfig created from dict.
- Return type:
- model_config: ClassVar[ConfigDict] = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- field random_seed: int = 1234#
The random seed used for calibration.
- to_dict() dict [source]#
Dump a CalibConfig instance to a dict.
- Returns:
The dict dumped from CalibConfig.
- Return type:
dict
- field tokenizer_max_seq_length: int = 2048#
The maximum sequence length to initialize tokenizer for calibration.
- class tensorrt_llm.llmapi.BuildCacheConfig(
- cache_root: Path | None = None,
- max_records: int = 10,
- max_cache_storage_gb: float = 256,
Bases:
object
Configuration for the build cache.
- cache_root#
The root directory for the build cache.
- Type:
str
- max_records#
The maximum number of records to store in the cache.
- Type:
int
- max_cache_storage_gb#
The maximum amount of storage (in GB) to use for the cache.
- Type:
float
Note
The build-cache assumes the weights of the model are not changed during the execution. If the weights are changed, you should remove the caches manually.
- __init__(
- cache_root: Path | None = None,
- max_records: int = 10,
- max_cache_storage_gb: float = 256,
- property cache_root: Path#
- property max_cache_storage_gb: float#
- property max_records: int#
- class tensorrt_llm.llmapi.RequestError[source]#
Bases:
RuntimeError
The error raised when the request is failed.
- class tensorrt_llm.llmapi.MpiCommSession(comm=None, n_workers: int = 1)[source]#
Bases:
MpiSession
- class tensorrt_llm.llmapi.ExtendedRuntimePerfKnobConfig(
- *,
- multi_block_mode: bool = True,
- enable_context_fmha_fp32_acc: bool = False,
- cuda_graph_mode: bool = False,
- cuda_graph_cache_size: int = 0,
Bases:
BaseModel
,PybindMirror
Configuration for extended runtime performance knobs.
- field cuda_graph_cache_size: int = 0#
Number of cuda graphs to be cached in the runtime. The larger the cache, the better the perf, but more GPU memory is consumed.
- field cuda_graph_mode: bool = False#
Whether to use CUDA graph mode.
- field enable_context_fmha_fp32_acc: bool = False#
Whether to enable context FMHA FP32 accumulation.
- model_config: ClassVar[ConfigDict] = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- field multi_block_mode: bool = True#
Whether to use multi-block mode.
- class tensorrt_llm.llmapi.BatchingType(
- value,
- names=<not given>,
- *values,
- module=None,
- qualname=None,
- type=None,
- start=1,
- boundary=None,
Bases:
StrEnum
- INFLIGHT = 'INFLIGHT'#
- STATIC = 'STATIC'#
- class tensorrt_llm.llmapi.ContextChunkingPolicy(
- value,
- names=<not given>,
- *values,
- module=None,
- qualname=None,
- type=None,
- start=1,
- boundary=None,
Bases:
StrEnum
Context chunking policy.
- EQUAL_PROGRESS = 'EQUAL_PROGRESS'#
- FIRST_COME_FIRST_SERVED = 'FIRST_COME_FIRST_SERVED'#
- class tensorrt_llm.llmapi.DynamicBatchConfig(
- *,
- enable_batch_size_tuning: bool,
- enable_max_num_tokens_tuning: bool,
- dynamic_batch_moving_average_window: int,
Bases:
BaseModel
,PybindMirror
Dynamic batch configuration.
Controls how batch size and token limits are dynamically adjusted at runtime.
- field dynamic_batch_moving_average_window: int [Required]#
The window size for moving average of input and output length which is used to calculate dynamic batch size and max num tokens
- field enable_batch_size_tuning: bool [Required]#
Controls if the batch size should be tuned dynamically
- field enable_max_num_tokens_tuning: bool [Required]#
Controls if the max num tokens should be tuned dynamically
- model_config: ClassVar[ConfigDict] = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].