API Reference

class tensorrt_llm.llmapi.LLM(model: str, tokenizer: str | Path | PreTrainedTokenizerBase | TokenizerBase | None = None, tokenizer_mode: Literal['auto', 'slow'] = 'auto', skip_tokenizer_init: bool = False, trust_remote_code: bool = False, tensor_parallel_size: int = 1, dtype: str = 'auto', revision: str | None = None, tokenizer_revision: str | None = None, **kwargs: Any)[source]

Bases: object

LLM class is the main class for running a LLM model.

Parameters:
  • model (str or Path) – The model name or a local model directory. Note that if the value could be both a model name or a local model directory, the local model directory will be prioritized.

  • tokenizer (str, Path, TokenizerBase, PreTrainedTokenizerBase, optional) – The name or path of a HuggingFace Transformers tokenizer, or the loaded tokenizer. Defaults to None.

  • tokenizer_mode (Literal['auto', 'slow']) – The tokenizer mode. ‘auto’ will use the fast tokenizer if available, and ‘slow’ will always use the slow tokenizer. The fast tokenizer is based on Huggingface’s Rust library tokenizers, which achieves a significant speed-up compared to its slow counterpart. Defaults to ‘auto’.

  • skip_tokenizer_init (bool) – If true, skip initialization of tokenizer and detokenizer. LLM.generate and LLM.generate_async will accept prompt token ids as input only. Defaults to False.

  • trust_remote_code (bool) – Whether to trust remote code when downloading model and tokenizer from Hugging Face. Defaults to False.

  • tensor_parallel_size (int) – The number of processes for tensor parallelism. Defaults to 1.

  • dtype (str) – The data type for the model weights and activations. Can be “float16”, “bfloat16”, “float32”, or “auto”. If “auto”, the data type will be automatically inferred from the source model. If the source data type is “float32”, it will be converted to “float16”. Defaults to “auto”.

  • revision (str, optional) – The revision of the model to use. Defaults to None.

  • tokenizer_revision (str, optional) – The revision of the tokenizer to use. Defaults to None.

  • pipeline_parallel_size (int) – The pipeline parallel size. Defaults to 1.

  • load_format (Literal['auto', 'dummy']) – The format of the model weights to load. * ‘auto’ will try to load the weights from the provided checkpoint. * ‘dummy’ will initialize the weights with random values, which is mainly for profiling. Defaults to ‘auto’.

  • enable_tqdm (bool) – Whether to display a progress bar during model building. Defaults to False.

  • enable_lora (bool) – Enable LoRA adapters. Defaults to False.

  • max_lora_rank (int, optional) – Maximum LoRA rank. If specified, it overrides build_config.lora_config.max_lora_rank. Defaults to None.

  • max_loras (int) – Maximum number of LoRA adapters to be stored in GPU memory. Defaults to 4.

  • max_cpu_loras (int) – Maximum number of LoRA adapters to be stored in CPU memory. Defaults to 4.

  • enable_prompt_adapter (bool) – Enable prompt adapters. Defaults to False.

  • max_prompt_adapter_token (int) – Maximum number of prompt adapter tokens. Defaults to 0.

  • quant_config (QuantConfig, optional) – The quantization configuration for the model. Defaults to None.

  • calib_config (CalibConfig, optional) – The calibration configuration for the model. Defaults to None.

  • build_config (BuildConfig, optional)) – The build configuration for the model. Defaults to None.

  • kv_cache_config (KvCacheConfig, optional) – The key-value cache configuration for the model. Defaults to None.

  • enable_chunked_prefill (bool) – Whether to enable chunked prefill. Defaults to False.

  • decoding_config (DecodingConfig, optional) – The decoding configuration for the model. Defaults to None.

  • logits_post_processor_map (Dict[str, Callable], optional) – A map of logit post-processing functions. Defaults to None.

  • iter_stats_max_iterations (int, optional) – The maximum number of iterations for iteration statistics. Defaults to None.

  • request_stats_max_iterations (int, optional) – The maximum number of iterations for request statistics. Defaults to None.

  • workspace (str, optional) – The directory to store intermediate files. Defaults to None.

  • embedding_parallel_mode (str) – The parallel mode for embeddings. Defaults to ‘SHARDING_ALONG_VOCAB’.

  • share_embedding_table (bool) – Whether to share the embedding table. Defaults to False.

  • auto_parallel (bool) – Enable auto parallel mode. Defaults to False.

  • auto_parallel_world_size (int) – The MPI world size for auto parallel. Defaults to 1.

  • moe_tensor_parallel_size (int, optional) – The tensor parallel size for MoE models’s expert weights.

  • moe_expert_parallel_size (int, optional) – The expert parallel size for MoE models’s expert weights.

  • fast_build – (bool): Enable features for faster engine building. This may cause some performance degradation and is currently incompatible with int8/int4 quantization. Defaults to False.

  • enable_build_cache (bool, BuildCacheConfig, optional) – Whether to enable build caching for the model. Defaults to None.

  • peft_cache_config (PeftCacheConfig, optional) – The PEFT cache configuration for the model. Defaults to None.

  • scheduler_config (SchedulerConfig, optional) – The scheduler configuration for the model. Defaults to None.

  • batching_type (BatchingType, optional) – The batching type for the model. Defaults to None.

  • normalize_log_probs (bool) – Whether to normalize log probabilities for the model. Defaults to False.

  • enable_processes_for_single_gpu (bool) – Whether to enable processes for single GPU, Defaults to False. This helps to improve the streaming generation performance.

__init__(model: str, tokenizer: str | Path | PreTrainedTokenizerBase | TokenizerBase | None = None, tokenizer_mode: Literal['auto', 'slow'] = 'auto', skip_tokenizer_init: bool = False, trust_remote_code: bool = False, tensor_parallel_size: int = 1, dtype: str = 'auto', revision: str | None = None, tokenizer_revision: str | None = None, **kwargs: Any)[source]
generate(inputs: str | List[int] | Sequence[str | List[int]], sampling_params: SamplingParams | List[SamplingParams] | None = None, use_tqdm: bool = True, lora_request: LoRARequest | Sequence[LoRARequest] | None = None, prompt_adapter_request: PromptAdapterRequest | Sequence[PromptAdapterRequest] | None = None) RequestOutput | List[RequestOutput][source]

Generate output for the given prompts in the synchronous mode. Synchronous generation accepts either single prompt or batched prompts.

Parameters:
  • inputs (PromptInputs or Sequence[PromptInputs]) – The prompt text or token ids. it can be single prompt or batched prompts.

  • sampling_params (SamplingParams, List[SamplingParams], optional) – The sampling params for the generation, a default one will be used if not provided. Defaults to None.

  • use_tqdm (bool) – Whether to use tqdm to display the progress bar. Defaults to True.

  • lora_request (LoRARequest, Sequence[LoRARequest], optional) – LoRA request to use for generation, if any. Defaults to None.

  • prompt_adapter_request (PromptAdapterRequest, Sequence[PromptAdapterRequest], optional) – Prompt Adapter request to use for generation, if any. Defaults to None.

Returns:

The output data of the completion request to the LLM.

Return type:

Union[RequestOutput, List[RequestOutput]]

generate_async(inputs: str | List[int], sampling_params: SamplingParams | None = None, lora_request: LoRARequest | None = None, prompt_adapter_request: PromptAdapterRequest | None = None, streaming: bool = False) RequestOutput[source]

Generate output for the given prompt in the asynchronous mode. Asynchronous generation accepts single prompt only.

Parameters:
  • inputs (PromptInputs) – The prompt text or token ids; it must be single prompt.

  • sampling_params (SamplingParams, optional) – The sampling params for the generation, a default one will be used if not provided. Defaults to None.

  • lora_request (LoRARequest, optional) – LoRA request to use for generation, if any. Defaults to None.

  • prompt_adapter_request (PromptAdapterRequest, optional) – Prompt Adapter request to use for generation, if any. Defaults to None.

  • streaming (bool) – Whether to use the streaming mode for the generation. Defaults to False.

Returns:

The output data of the completion request to the LLM.

Return type:

RequestOutput

save(engine_dir: str)[source]

Save the built engine to the given path.

Parameters:

engine_dir (str) – The path to save the engine.

Returns:

None

property tokenizer: TokenizerBase | None
property workspace: Path
class tensorrt_llm.llmapi.RequestOutput(generation_result: GenerationResult, prompt: str | None = None, tokenizer: TokenizerBase | None = None)[source]

Bases: GenerationResult

The output data of a completion request to the LLM.

Fields:

request_id (int): The unique ID of the request. prompt (str, optional): The prompt string of the request. prompt_token_ids (List[int]): The token ids of the prompt. outputs (List[CompletionOutput]): The output sequences of the request. context_logits (torch.Tensor, optional): The logits on the prompt token ids. finished (bool): Whether the whole request is finished.

__init__(generation_result: GenerationResult, prompt: str | None = None, tokenizer: TokenizerBase | None = None) None[source]
handle_response(response)[source]
class tensorrt_llm.llmapi.SamplingParams(*, end_id: int | None = None, pad_id: int | None = None, max_tokens: int = 32, max_new_tokens: int | None = None, bad: List[str] | str | None = None, bad_token_ids: List[int] | None = None, stop: List[str] | str | None = None, stop_token_ids: List[int] | None = None, include_stop_str_in_output: bool = False, embedding_bias: Tensor | None = None, external_draft_tokens_config: ExternalDraftTokensConfig | None = None, logits_post_processor_name: str | None = None, n: int = 1, best_of: int | None = None, use_beam_search: bool = False, beam_width: int = 1, num_return_sequences: int | None = None, top_k: int | None = None, top_p: float | None = None, top_p_min: float | None = None, top_p_reset_ids: int | None = None, top_p_decay: float | None = None, seed: int | None = None, random_seed: int | None = None, temperature: float | None = None, min_tokens: int | None = None, min_length: int | None = None, beam_search_diversity_rate: float | None = None, repetition_penalty: float | None = None, presence_penalty: float | None = None, frequency_penalty: float | None = None, length_penalty: float | None = None, early_stopping: int | None = None, no_repeat_ngram_size: int | None = None, return_log_probs: bool = False, return_context_logits: bool = False, return_generation_logits: bool = False, exclude_input_from_output: bool = True, return_encoder_output: bool = False, ignore_eos: bool = False, detokenize: bool = True, add_special_tokens: bool = True, truncate_prompt_tokens: int | None = None, skip_special_tokens: bool = True, spaces_between_special_tokens: bool = True)[source]

Bases: object

Sampling parameters for text generation.

Parameters:
  • end_id (int, optional) – The end token id. Defaults to None.

  • pad_id (int, optional) – The pad token id. Defaults to None.

  • max_tokens (int) – The maximum number of tokens to generate. Defaults to 32.

  • max_new_tokens (int, optional) – The maximum number of tokens to generate. This argument is being deprecated; please use max_tokens instead. Defaults to None.

  • bad (str, List[str], optional) – A string or a list of strings that redirect the generation when they are generated, so that the bad strings are excluded from the returned output. Defaults to None.

  • bad_token_ids (List[int], optional) – A list of token ids that redirect the generation when they are generated, so that the bad ids are excluded from the returned output. Defaults to None.

  • stop (str, List[str], optional) – A string or a list of strings that stop the generation when they are generated. The returned output will not contain the stop strings unless include_stop_str_in_output is True. Defaults to None.

  • stop_token_ids (List[int], optional) – A list of token ids that stop the generation when they are generated. Defaults to None.

  • include_stop_str_in_output (bool) – Whether to include the stop strings in output text. Defaults to False.

  • embedding_bias (torch.Tensor, optional) – The embedding bias tensor. Expected type is kFP32 and shape is [vocab_size]. Defaults to None.

  • external_draft_tokens_config (ExternalDraftTokensConfig, optional) – The speculative decoding configuration. Defaults to None.

  • logits_post_processor_name (str, optional) – The logits postprocessor name. Must correspond to one of the logits postprocessor name provided to the ExecutorConfig. Defaults to None.

  • n (int) – Number of sequences to generate. Defaults to 1.

  • best_of (int, optional) – Number of sequences to consider for best output. Defaults to None.

  • use_beam_search (bool) – Whether to use beam search. Defaults to False.

  • beam_width (int) – The beam width. Setting 1 disables beam search. This parameter will be deprecated from the LLM API in a future release. Please use n/best_of/use_beam_search instead. Defaults to 1.

  • num_return_sequences (int, optional) – The number of sequences to return. If set to None, it defaults to the value of beam_width. The default is None. This parameter will be deprecated from the LLM API in a future release. Please use n/best_of/use_beam_search instead. Defaults to None.

  • top_k (int) – Controls number of logits to sample from. Default is 0 (all logits).

  • top_p (float) – Controls the top-P probability to sample from. Default is 0.f

  • top_p_min (float) – Controls decay in the top-P algorithm. topPMin is lower-bound. Default is 1.e-6.

  • top_p_reset_ids (int) – Controls decay in the top-P algorithm. Indicates where to reset the decay. Default is 1.

  • top_p_decay (float) – Controls decay in the top-P algorithm. The decay value. Default is 1.f

  • seed (int) – Controls the random seed used by the random number generator in sampling

  • random_seed (int) – Controls the random seed used by the random number generator in sampling. This argument is being deprecated; please use seed instead.

  • temperature (float) – Controls the modulation of logits when sampling new tokens. It can have values > 0.f. Default is 1.0f

  • min_tokens (int) – Lower bound on the number of tokens to generate. Values < 1 have no effect. Default is 1.

  • min_length (int) – Lower bound on the number of tokens to generate. Values < 1 have no effect. Default is 1. This argument is being deprecated; please use min_tokens instead.

  • beam_search_diversity_rate (float) – Controls the diversity in beam search.

  • repetition_penalty (float) – Used to penalize tokens based on how often they appear in the sequence. It can have any value > 0.f. Values < 1.f encourages repetition, values > 1.f discourages it. Default is 1.f

  • presence_penalty (float) – Used to penalize tokens already present in the sequence (irrespective of the number of appearances). It can have any values. Values < 0.f encourage repetition, values > 0.f discourage it. Default is 0.f

  • frequency_penalty (float) – Used to penalize tokens already present in the sequence (dependent on the number of appearances). It can have any values. Values < 0.f encourage repetition, values > 0.f discourage it. Default is 0.f

  • length_penalty (float) – Controls how to penalize longer sequences in beam search. Default is 0.f

  • early_stopping (int) – Controls whether the generation process finishes once beamWidth sentences are generated (ends with end_token)

  • no_repeat_ngram_size (int) – Controls how many repeat ngram size are acceptable. Default is 1 << 30.

  • return_log_probs (bool) – Controls if Result should contain log probabilities. Default is false.

  • return_context_logits (bool) – Controls if Result should contain the context logits. Default is false.

  • return_generation_logits (bool) – Controls if Result should contain the generation logits. Default is false.

  • exclude_input_from_output (bool) – Controls if output tokens in Result should include the input tokens. Default is true.

  • return_encoder_output (bool) – Controls if Result should contain encoder output hidden states (for encoder-only and encoder-decoder models). Default is false.

  • ignore_eos (bool) – Whether to ignore the EOS token and continue generating tokens after the EOS token is generated. Defaults to False.

  • detokenize (bool) – Whether to detokenize the output. Defaults to True.

  • add_special_tokens (bool) – Whether to add special tokens to the prompt. Defaults to True.

  • truncate_prompt_tokens (int, optional) – If set to an integer k, will use only the last k tokens from the prompt (i.e., left truncation). Defaults to None.

  • skip_special_tokens (bool) – Whether to skip special tokens in the output. Defaults to True.

  • spaces_between_special_tokens (bool) – Whether to add spaces between special tokens in the output. Defaults to True.

__init__(*, end_id: int | None = None, pad_id: int | None = None, max_tokens: int = 32, max_new_tokens: int | None = None, bad: List[str] | str | None = None, bad_token_ids: List[int] | None = None, stop: List[str] | str | None = None, stop_token_ids: List[int] | None = None, include_stop_str_in_output: bool = False, embedding_bias: Tensor | None = None, external_draft_tokens_config: ExternalDraftTokensConfig | None = None, logits_post_processor_name: str | None = None, n: int = 1, best_of: int | None = None, use_beam_search: bool = False, beam_width: int = 1, num_return_sequences: int | None = None, top_k: int | None = None, top_p: float | None = None, top_p_min: float | None = None, top_p_reset_ids: int | None = None, top_p_decay: float | None = None, seed: int | None = None, random_seed: int | None = None, temperature: float | None = None, min_tokens: int | None = None, min_length: int | None = None, beam_search_diversity_rate: float | None = None, repetition_penalty: float | None = None, presence_penalty: float | None = None, frequency_penalty: float | None = None, length_penalty: float | None = None, early_stopping: int | None = None, no_repeat_ngram_size: int | None = None, return_log_probs: bool = False, return_context_logits: bool = False, return_generation_logits: bool = False, exclude_input_from_output: bool = True, return_encoder_output: bool = False, ignore_eos: bool = False, detokenize: bool = True, add_special_tokens: bool = True, truncate_prompt_tokens: int | None = None, skip_special_tokens: bool = True, spaces_between_special_tokens: bool = True) None
add_special_tokens: bool
bad: List[str] | str | None
bad_token_ids: List[int] | None
beam_search_diversity_rate: float | None
beam_width: int
best_of: int | None
detokenize: bool
early_stopping: int | None
embedding_bias: Tensor | None
end_id: int | None
exclude_input_from_output: bool
external_draft_tokens_config: ExternalDraftTokensConfig | None
frequency_penalty: float | None
property greedy_decoding: bool
ignore_eos: bool
include_stop_str_in_output: bool
length_penalty: float | None
logits_post_processor_name: str | None
max_new_tokens: int | None
max_tokens: int
min_length: int | None
min_tokens: int | None
n: int
no_repeat_ngram_size: int | None
num_return_sequences: int | None
pad_id: int | None
presence_penalty: float | None
random_seed: int | None
repetition_penalty: float | None
return_context_logits: bool
return_encoder_output: bool
return_generation_logits: bool
return_log_probs: bool
seed: int | None
setup(tokenizer, add_special_tokens: bool = False) SamplingParams[source]
skip_special_tokens: bool
spaces_between_special_tokens: bool
stop: List[str] | str | None
stop_token_ids: List[int] | None
temperature: float | None
top_k: int | None
top_p: float | None
top_p_decay: float | None
top_p_min: float | None
top_p_reset_ids: int | None
truncate_prompt_tokens: int | None
class tensorrt_llm.llmapi.KvCacheConfig

Bases: pybind11_object

__init__(self: tensorrt_llm.bindings.executor.KvCacheConfig, enable_block_reuse: bool = False, max_tokens: int | None = None, max_attention_window: list[int] | None = None, sink_token_length: int | None = None, free_gpu_memory_fraction: float | None = None, host_cache_size: int | None = None, onboard_blocks: bool = True, cross_kv_cache_fraction: float | None = None, secondary_offload_min_priority: int | None = None, event_buffer_max_size: int = 0, *, runtime_defaults: tensorrt_llm.bindings.executor.RuntimeDefaults | None = None) None
property cross_kv_cache_fraction
property enable_block_reuse
property event_buffer_max_size
fill_empty_fields_from_runtime_defaults(self: tensorrt_llm.bindings.executor.KvCacheConfig, arg0: tensorrt_llm.bindings.executor.RuntimeDefaults) None
property free_gpu_memory_fraction
property host_cache_size
property max_attention_window
property max_tokens
property onboard_blocks
property secondary_offload_min_priority
property sink_token_length
class tensorrt_llm.llmapi.SchedulerConfig

Bases: pybind11_object

__init__(self: tensorrt_llm.bindings.executor.SchedulerConfig, capacity_scheduler_policy: tensorrt_llm.bindings.executor.CapacitySchedulerPolicy = CapacitySchedulerPolicy.GUARANTEED_NO_EVICT, context_chunking_policy: tensorrt_llm.bindings.executor.ContextChunkingPolicy | None = None, dynamic_batch_config: tensorrt_llm.bindings.executor.DynamicBatchConfig | None = None) None
property capacity_scheduler_policy
property context_chunking_policy
property dynamic_batch_config
class tensorrt_llm.llmapi.CapacitySchedulerPolicy

Bases: pybind11_object

Members:

MAX_UTILIZATION

GUARANTEED_NO_EVICT

STATIC_BATCH

GUARANTEED_NO_EVICT = <CapacitySchedulerPolicy.GUARANTEED_NO_EVICT: 1>
MAX_UTILIZATION = <CapacitySchedulerPolicy.MAX_UTILIZATION: 0>
STATIC_BATCH = <CapacitySchedulerPolicy.STATIC_BATCH: 2>
__init__(self: tensorrt_llm.bindings.executor.CapacitySchedulerPolicy, value: int) None
property name
property value
class tensorrt_llm.llmapi.BuildConfig(max_input_len: int = 1024, max_seq_len: int = None, opt_batch_size: int = 8, max_batch_size: int = 2048, max_beam_width: int = 1, max_num_tokens: int = 8192, opt_num_tokens: Optional[int] = None, max_prompt_embedding_table_size: int = 0, kv_cache_type: tensorrt_llm.bindings.KVCacheType = None, gather_context_logits: int = False, gather_generation_logits: int = False, strongly_typed: bool = True, force_num_profiles: Optional[int] = None, profiling_verbosity: str = 'layer_names_only', enable_debug_output: bool = False, max_draft_len: int = 0, speculative_decoding_mode: tensorrt_llm.models.modeling_utils.SpeculativeDecodingMode = <SpeculativeDecodingMode.NONE: 1>, use_refit: bool = False, input_timing_cache: str = None, output_timing_cache: str = 'model.cache', lora_config: tensorrt_llm.lora_manager.LoraConfig = <factory>, auto_parallel_config: tensorrt_llm.auto_parallel.config.AutoParallelConfig = <factory>, weight_sparsity: bool = False, weight_streaming: bool = False, plugin_config: tensorrt_llm.plugin.plugin.PluginConfig = <factory>, use_strip_plan: bool = False, max_encoder_input_len: int = 1024, use_fused_mlp: bool = True, dry_run: bool = False, visualize_network: bool = False, monitor_memory: bool = False, use_mrope: bool = False)[source]

Bases: object

__init__(max_input_len: int = 1024, max_seq_len: int = None, opt_batch_size: int = 8, max_batch_size: int = 2048, max_beam_width: int = 1, max_num_tokens: int = 8192, opt_num_tokens: int | None = None, max_prompt_embedding_table_size: int = 0, kv_cache_type: ~tensorrt_llm.bindings.KVCacheType = None, gather_context_logits: int = False, gather_generation_logits: int = False, strongly_typed: bool = True, force_num_profiles: int | None = None, profiling_verbosity: str = 'layer_names_only', enable_debug_output: bool = False, max_draft_len: int = 0, speculative_decoding_mode: ~tensorrt_llm.models.modeling_utils.SpeculativeDecodingMode = SpeculativeDecodingMode.NONE, use_refit: bool = False, input_timing_cache: str = None, output_timing_cache: str = 'model.cache', lora_config: ~tensorrt_llm.lora_manager.LoraConfig = <factory>, auto_parallel_config: ~tensorrt_llm.auto_parallel.config.AutoParallelConfig = <factory>, weight_sparsity: bool = False, weight_streaming: bool = False, plugin_config: ~tensorrt_llm.plugin.plugin.PluginConfig = <factory>, use_strip_plan: bool = False, max_encoder_input_len: int = 1024, use_fused_mlp: bool = True, dry_run: bool = False, visualize_network: bool = False, monitor_memory: bool = False, use_mrope: bool = False) None
auto_parallel_config: AutoParallelConfig
dry_run: bool = False
enable_debug_output: bool = False
force_num_profiles: int | None = None
classmethod from_dict(config, plugin_config=None)[source]
classmethod from_json_file(config_file, plugin_config=None)[source]
gather_context_logits: int = False
gather_generation_logits: int = False
input_timing_cache: str = None
kv_cache_type: KVCacheType = None
lora_config: LoraConfig
max_batch_size: int = 2048
max_beam_width: int = 1
max_draft_len: int = 0
max_encoder_input_len: int = 1024
max_input_len: int = 1024
max_num_tokens: int = 8192
max_prompt_embedding_table_size: int = 0
max_seq_len: int = None
monitor_memory: bool = False
opt_batch_size: int = 8
opt_num_tokens: int | None = None
output_timing_cache: str = 'model.cache'
plugin_config: PluginConfig
profiling_verbosity: str = 'layer_names_only'
speculative_decoding_mode: SpeculativeDecodingMode = 1
strongly_typed: bool = True
to_dict()[source]
update(**kwargs)[source]
update_from_dict(config: dict)[source]
update_kv_cache_type(model_architecture: str)[source]
use_fused_mlp: bool = True
use_mrope: bool = False
use_refit: bool = False
use_strip_plan: bool = False
visualize_network: bool = False
weight_sparsity: bool = False
weight_streaming: bool = False
class tensorrt_llm.llmapi.QuantConfig(quant_algo: QuantAlgo | None = None, kv_cache_quant_algo: QuantAlgo | None = None, group_size: int | None = 128, smoothquant_val: float = 0.5, clamp_val: List[float] | None = None, use_meta_recipe: bool = False, has_zero_point: bool | None = False, pre_quant_scale: bool | None = False, exclude_modules: List[str] | None = None)[source]

Bases: object

Serializable quantization configuration class, part of the PretrainedConfig

__init__(quant_algo: QuantAlgo | None = None, kv_cache_quant_algo: QuantAlgo | None = None, group_size: int | None = 128, smoothquant_val: float = 0.5, clamp_val: List[float] | None = None, use_meta_recipe: bool = False, has_zero_point: bool | None = False, pre_quant_scale: bool | None = False, exclude_modules: List[str] | None = None) None
clamp_val: List[float] | None = None
exclude_modules: List[str] | None = None
classmethod from_dict(config: dict)[source]
get_modelopt_kv_cache_dtype()[source]
get_modelopt_qformat()[source]
get_quant_cfg(module_name=None)[source]
group_size: int | None = 128
has_zero_point: bool | None = False
kv_cache_quant_algo: QuantAlgo | None = None
property layer_quant_mode: QuantMode
pre_quant_scale: bool | None = False
quant_algo: QuantAlgo | None = None
property quant_mode: QuantModeWrapper
property requires_calibration
property requires_modelopt_quantization
smoothquant_val: float = 0.5
to_dict()[source]
use_meta_recipe: bool = False
property use_plugin_sq
class tensorrt_llm.llmapi.QuantAlgo(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: StrEnum

FP8 = 'FP8'
FP8_PER_CHANNEL_PER_TOKEN = 'FP8_PER_CHANNEL_PER_TOKEN'
INT8 = 'INT8'
MIXED_PRECISION = 'MIXED_PRECISION'
NO_QUANT = 'NO_QUANT'
W4A16 = 'W4A16'
W4A16_AWQ = 'W4A16_AWQ'
W4A16_GPTQ = 'W4A16_GPTQ'
W4A8_AWQ = 'W4A8_AWQ'
W4A8_QSERVE_PER_CHANNEL = 'W4A8_QSERVE_PER_CHANNEL'
W4A8_QSERVE_PER_GROUP = 'W4A8_QSERVE_PER_GROUP'
W8A16 = 'W8A16'
W8A16_GPTQ = 'W8A16_GPTQ'
W8A8_SQ_PER_CHANNEL = 'W8A8_SQ_PER_CHANNEL'
W8A8_SQ_PER_CHANNEL_PER_TENSOR_PLUGIN = 'W8A8_SQ_PER_CHANNEL_PER_TENSOR_PLUGIN'
W8A8_SQ_PER_CHANNEL_PER_TOKEN_PLUGIN = 'W8A8_SQ_PER_CHANNEL_PER_TOKEN_PLUGIN'
W8A8_SQ_PER_TENSOR_PER_TOKEN_PLUGIN = 'W8A8_SQ_PER_TENSOR_PER_TOKEN_PLUGIN'
W8A8_SQ_PER_TENSOR_PLUGIN = 'W8A8_SQ_PER_TENSOR_PLUGIN'
class tensorrt_llm.llmapi.CalibConfig(device: Literal['cuda', 'cpu'] = 'cuda', calib_dataset: str = 'cnn_dailymail', calib_batches: int = 512, calib_batch_size: int = 1, calib_max_seq_length: int = 512, random_seed: int = 1234, tokenizer_max_seq_length: int = 2048)[source]

Bases: object

Calibration configuration.

Parameters:
  • device (Literal['cuda', 'cpu'], default='cuda') – The device to run calibration.

  • calib_dataset (str, default='cnn_dailymail') – The name or local path of calibration dataset.

  • calib_batches (int, default=512) – The number of batches that the calibration runs.

  • calib_batch_size (int, default=1) – The batch size that the calibration runs.

  • calib_max_seq_length (int, default=512) – The maximum sequence length that the calibration runs.

  • random_seed (int, default=1234) – The random seed used for calibration.

  • tokenizer_max_seq_length (int, default=2048) – The maximum sequence length to initialize tokenizer for calibration.

__init__(device: Literal['cuda', 'cpu'] = 'cuda', calib_dataset: str = 'cnn_dailymail', calib_batches: int = 512, calib_batch_size: int = 1, calib_max_seq_length: int = 512, random_seed: int = 1234, tokenizer_max_seq_length: int = 2048) None
calib_batch_size: int
calib_batches: int
calib_dataset: str
calib_max_seq_length: int
device: Literal['cuda', 'cpu']
classmethod from_dict(config: dict)[source]
random_seed: int
to_dict()[source]
tokenizer_max_seq_length: int
class tensorrt_llm.llmapi.BuildCacheConfig(cache_root: Path | None = None, max_records: int = 10, max_cache_storage_gb: float = 256)[source]

Bases: object

Configuration for the build cache.

cache_root

The root directory for the build cache.

Type:

str

max_records

The maximum number of records to store in the cache.

Type:

int

max_cache_storage_gb

The maximum amount of storage (in GB) to use for the cache.

Type:

float

Note

The build-cache assumes the weights of the model are not changed during the execution. If the weights are changed, you should remove the caches manually.

__init__(cache_root: Path | None = None, max_records: int = 10, max_cache_storage_gb: float = 256)[source]
property cache_root: Path
property max_cache_storage_gb: float
property max_records: int
class tensorrt_llm.llmapi.RequestError[source]

Bases: RuntimeError

The error raised when the request is failed.

class tensorrt_llm.llmapi.NoStatsAvailable[source]

Bases: Exception