API Reference
- class tensorrt_llm.llmapi.LLM(model: str, tokenizer: str | Path | PreTrainedTokenizerBase | TokenizerBase | None = None, tokenizer_mode: Literal['auto', 'slow'] = 'auto', skip_tokenizer_init: bool = False, trust_remote_code: bool = False, tensor_parallel_size: int = 1, dtype: str = 'auto', revision: str | None = None, tokenizer_revision: str | None = None, speculative_model: str | None = None, **kwargs: Any)[source]
Bases:
object
LLM class is the main class for running a LLM model.
- Parameters:
model (str or Path) – The model name or a local model directory. Note that if the value could be both a model name or a local model directory, the local model directory will be prioritized.
tokenizer (str, Path, TokenizerBase, PreTrainedTokenizerBase, optional) – The name or path of a HuggingFace Transformers tokenizer, or the loaded tokenizer. Defaults to None.
tokenizer_mode (Literal['auto', 'slow']) – The tokenizer mode. ‘auto’ will use the fast tokenizer if available, and ‘slow’ will always use the slow tokenizer. The fast tokenizer is based on Huggingface’s Rust library tokenizers, which achieves a significant speed-up compared to its slow counterpart. Defaults to ‘auto’.
skip_tokenizer_init (bool) – If true, skip initialization of tokenizer and detokenizer. LLM.generate and LLM.generate_async will accept prompt token ids as input only. Defaults to False.
trust_remote_code (bool) – Whether to trust remote code when downloading model and tokenizer from Hugging Face. Defaults to False.
tensor_parallel_size (int) – The number of processes for tensor parallelism. Defaults to 1.
dtype (str) – The data type for the model weights and activations. Can be “float16”, “bfloat16”, “float32”, or “auto”. If “auto”, the data type will be automatically inferred from the source model. If the source data type is “float32”, it will be converted to “float16”. Defaults to “auto”.
revision (str, optional) – The revision of the model to use. Defaults to None.
tokenizer_revision (str, optional) – The revision of the tokenizer to use. Defaults to None.
pipeline_parallel_size (int) – The pipeline parallel size. Defaults to 1.
context_parallel_size (int) – The context parallel size. Defaults to 1.
load_format (Literal['auto', 'dummy']) – The format of the model weights to load. * ‘auto’ will try to load the weights from the provided checkpoint. * ‘dummy’ will initialize the weights with random values, which is mainly for profiling. Defaults to ‘auto’.
enable_tqdm (bool) – Whether to display a progress bar during model building. Defaults to False.
enable_lora (bool) – Enable LoRA adapters. Defaults to False.
max_lora_rank (int, optional) – Maximum LoRA rank. If specified, it overrides build_config.lora_config.max_lora_rank. Defaults to None.
max_loras (int) – Maximum number of LoRA adapters to be stored in GPU memory. Defaults to 4.
max_cpu_loras (int) – Maximum number of LoRA adapters to be stored in CPU memory. Defaults to 4.
enable_prompt_adapter (bool) – Enable prompt adapters. Defaults to False.
max_prompt_adapter_token (int) – Maximum number of prompt adapter tokens. Defaults to 0.
quant_config (QuantConfig, optional) – The quantization configuration for the model. Defaults to None.
calib_config (CalibConfig, optional) – The calibration configuration for the model. Defaults to None.
build_config (BuildConfig, optional)) – The build configuration for the model. Defaults to None.
kv_cache_config (KvCacheConfig, optional) – The key-value cache configuration for the model. Defaults to None.
enable_chunked_prefill (bool) – Whether to enable chunked prefill. Defaults to False.
decoding_config (DecodingConfig, optional) – The decoding configuration for the model. Defaults to None.
guided_decoding_backend (str, optional) – The guided decoding backend, currently supports ‘xgrammar’. Defaults to None.
logits_post_processor_map (Dict[str, Callable], optional) – A map of logit post-processing functions. Defaults to None.
iter_stats_max_iterations (int, optional) – The maximum number of iterations for iteration statistics. Defaults to None.
request_stats_max_iterations (int, optional) – The maximum number of iterations for request statistics. Defaults to None.
workspace (str, optional) – The directory to store intermediate files. Defaults to None.
embedding_parallel_mode (str) – The parallel mode for embeddings. Defaults to ‘SHARDING_ALONG_VOCAB’.
auto_parallel (bool) – Enable auto parallel mode. Defaults to False.
auto_parallel_world_size (int) – The MPI world size for auto parallel. Defaults to 1.
moe_tensor_parallel_size (int, optional) – The tensor parallel size for MoE models’s expert weights.
moe_expert_parallel_size (int, optional) – The expert parallel size for MoE models’s expert weights.
fast_build – (bool): Enable features for faster engine building. This may cause some performance degradation and is currently incompatible with int8/int4 quantization. Defaults to False.
enable_build_cache (bool, BuildCacheConfig, optional) – Whether to enable build caching for the model. Defaults to None.
peft_cache_config (PeftCacheConfig, optional) – The PEFT cache configuration for the model. Defaults to None.
scheduler_config (SchedulerConfig, optional) – The scheduler configuration for the model. Defaults to None.
speculative_config (LookaheadDecodingConfig or other speculative configurations, optional) – The speculative decoding configuration. Defaults to None.
batching_type (BatchingType, optional) – The batching type for the model. Defaults to None.
normalize_log_probs (bool) – Whether to normalize log probabilities for the model. Defaults to False.
max_batch_size (int, optional) – The maximum batch size for runtime. Defaults to None.
max_num_tokens (int, optional) – The maximum number of tokens for runtime. Defaults to None.
extended_runtime_perf_knob_config (ExtendedRuntimePerfKnobConfig, optional) – The extended runtime performance knob configuration for the model. Defaults to None.
- __init__(model: str, tokenizer: str | Path | PreTrainedTokenizerBase | TokenizerBase | None = None, tokenizer_mode: Literal['auto', 'slow'] = 'auto', skip_tokenizer_init: bool = False, trust_remote_code: bool = False, tensor_parallel_size: int = 1, dtype: str = 'auto', revision: str | None = None, tokenizer_revision: str | None = None, speculative_model: str | None = None, **kwargs: Any)[source]
- generate(inputs: str | List[int] | Sequence[str | List[int]], sampling_params: SamplingParams | List[SamplingParams] | None = None, use_tqdm: bool = True, lora_request: LoRARequest | Sequence[LoRARequest] | None = None, prompt_adapter_request: PromptAdapterRequest | Sequence[PromptAdapterRequest] | None = None) RequestOutput | List[RequestOutput] [source]
Generate output for the given prompts in the synchronous mode. Synchronous generation accepts either single prompt or batched prompts.
- Parameters:
inputs (PromptInputs or Sequence[PromptInputs]) – The prompt text or token ids. it can be single prompt or batched prompts.
sampling_params (SamplingParams, List[SamplingParams], optional) – The sampling params for the generation, a default one will be used if not provided. Defaults to None.
use_tqdm (bool) – Whether to use tqdm to display the progress bar. Defaults to True.
lora_request (LoRARequest, Sequence[LoRARequest], optional) – LoRA request to use for generation, if any. Defaults to None.
prompt_adapter_request (PromptAdapterRequest, Sequence[PromptAdapterRequest], optional) – Prompt Adapter request to use for generation, if any. Defaults to None.
- Returns:
The output data of the completion request to the LLM.
- Return type:
Union[RequestOutput, List[RequestOutput]]
- generate_async(inputs: str | List[int], sampling_params: SamplingParams | None = None, lora_request: LoRARequest | None = None, prompt_adapter_request: PromptAdapterRequest | None = None, streaming: bool = False) RequestOutput [source]
Generate output for the given prompt in the asynchronous mode. Asynchronous generation accepts single prompt only.
- Parameters:
inputs (PromptInputs) – The prompt text or token ids; it must be single prompt.
sampling_params (SamplingParams, optional) – The sampling params for the generation, a default one will be used if not provided. Defaults to None.
lora_request (LoRARequest, optional) – LoRA request to use for generation, if any. Defaults to None.
prompt_adapter_request (PromptAdapterRequest, optional) – Prompt Adapter request to use for generation, if any. Defaults to None.
streaming (bool) – Whether to use the streaming mode for the generation. Defaults to False.
- Returns:
The output data of the completion request to the LLM.
- Return type:
- save(engine_dir: str)[source]
Save the built engine to the given path.
- Parameters:
engine_dir (str) – The path to save the engine.
- Returns:
None
- property tokenizer: TokenizerBase | None
- property workspace: Path
- class tensorrt_llm.llmapi.RequestOutput(generation_result: GenerationResult, prompt: str | None = None, tokenizer: TokenizerBase | None = None)[source]
Bases:
GenerationResult
The output data of a completion request to the LLM.
- Parameters:
request_id (int) – The unique ID of the request.
prompt (str, optional) – The prompt string of the request.
prompt_token_ids (List[int]) – The token ids of the prompt.
outputs (List[CompletionOutput]) – The output sequences of the request.
context_logits (torch.Tensor, optional) – The logits on the prompt token ids.
finished (bool) – Whether the whole request is finished.
- class tensorrt_llm.llmapi.GuidedDecodingParams(*, json: str | BaseModel | dict | None = None, regex: str | None = None, grammar: str | None = None, json_object: bool = False)[source]
Bases:
object
Guided decoding parameters for text generation. Only one of the fields could be effective.
- Parameters:
json (str, BaseModel, dict, optional) – The generated text is amenable to json format with additional user-specified restrictions, namely schema. Defaults to None.
regex (str, optional) – The generated text is amenable to the user-specified regular expression. Defaults to None.
grammar (str, optional) – The generated text is amenable to the user-specified extended Backus-Naur form (EBNF) grammar. Defaults to None.
json_object (bool) – If True, the generated text is amenable to json format. Defaults to False.
- __init__(*, json: str | BaseModel | dict | None = None, regex: str | None = None, grammar: str | None = None, json_object: bool = False) None
- grammar: str | None
- json: str | BaseModel | dict | None
- json_object: bool
- property num_guides
- regex: str | None
- class tensorrt_llm.llmapi.SamplingParams(*, end_id: int | None = None, pad_id: int | None = None, max_tokens: int = 32, max_new_tokens: int | None = None, bad: List[str] | str | None = None, bad_token_ids: List[int] | None = None, stop: List[str] | str | None = None, stop_token_ids: List[int] | None = None, include_stop_str_in_output: bool = False, embedding_bias: Tensor | None = None, external_draft_tokens_config: ExternalDraftTokensConfig | None = None, logits_post_processor_name: str | None = None, n: int = 1, best_of: int | None = None, use_beam_search: bool = False, beam_width: int = 1, num_return_sequences: int | None = None, top_k: int | None = None, top_p: float | None = None, top_p_min: float | None = None, top_p_reset_ids: int | None = None, top_p_decay: float | None = None, seed: int | None = None, random_seed: int | None = None, temperature: float | None = None, min_tokens: int | None = None, min_length: int | None = None, beam_search_diversity_rate: float | None = None, repetition_penalty: float | None = None, presence_penalty: float | None = None, frequency_penalty: float | None = None, length_penalty: float | None = None, early_stopping: int | None = None, no_repeat_ngram_size: int | None = None, return_log_probs: bool = False, return_context_logits: bool = False, return_generation_logits: bool = False, exclude_input_from_output: bool = True, return_encoder_output: bool = False, return_perf_metrics: bool = False, lookahead_config: LookaheadDecodingConfig | None = None, guided_decoding: GuidedDecodingParams | None = None, ignore_eos: bool = False, detokenize: bool = True, add_special_tokens: bool = True, truncate_prompt_tokens: int | None = None, skip_special_tokens: bool = True, spaces_between_special_tokens: bool = True)[source]
Bases:
object
Sampling parameters for text generation.
- Parameters:
end_id (int, optional) – The end token id. Defaults to None.
pad_id (int, optional) – The pad token id. Defaults to None.
max_tokens (int) – The maximum number of tokens to generate. Defaults to 32.
max_new_tokens (int, optional) – The maximum number of tokens to generate. This argument is being deprecated; please use max_tokens instead. Defaults to None.
bad (str, List[str], optional) – A string or a list of strings that redirect the generation when they are generated, so that the bad strings are excluded from the returned output. Defaults to None.
bad_token_ids (List[int], optional) – A list of token ids that redirect the generation when they are generated, so that the bad ids are excluded from the returned output. Defaults to None.
stop (str, List[str], optional) – A string or a list of strings that stop the generation when they are generated. The returned output will not contain the stop strings unless include_stop_str_in_output is True. Defaults to None.
stop_token_ids (List[int], optional) – A list of token ids that stop the generation when they are generated. Defaults to None.
include_stop_str_in_output (bool) – Whether to include the stop strings in output text. Defaults to False.
embedding_bias (torch.Tensor, optional) – The embedding bias tensor. Expected type is kFP32 and shape is [vocab_size]. Defaults to None.
external_draft_tokens_config (ExternalDraftTokensConfig, optional) – The speculative decoding configuration. Defaults to None.
logits_post_processor_name (str, optional) – The logits postprocessor name. Must correspond to one of the logits postprocessor name provided to the ExecutorConfig. Defaults to None.
n (int) – Number of sequences to generate. Defaults to 1.
best_of (int, optional) – Number of sequences to consider for best output. Defaults to None.
use_beam_search (bool) – Whether to use beam search. Defaults to False.
beam_width (int) – The beam width. Setting 1 disables beam search. This parameter will be deprecated from the LLM API in a future release. Please use n/best_of/use_beam_search instead. Defaults to 1.
num_return_sequences (int, optional) – The number of sequences to return. If set to None, it defaults to the value of beam_width. The default is None. This parameter will be deprecated from the LLM API in a future release. Please use n/best_of/use_beam_search instead. Defaults to None.
top_k (int) – Controls number of logits to sample from. Default is 0 (all logits).
top_p (float) – Controls the top-P probability to sample from. Default is 0.f
top_p_min (float) – Controls decay in the top-P algorithm. topPMin is lower-bound. Default is 1.e-6.
top_p_reset_ids (int) – Controls decay in the top-P algorithm. Indicates where to reset the decay. Default is 1.
top_p_decay (float) – Controls decay in the top-P algorithm. The decay value. Default is 1.f
seed (int) – Controls the random seed used by the random number generator in sampling
random_seed (int) – Controls the random seed used by the random number generator in sampling. This argument is being deprecated; please use seed instead.
temperature (float) – Controls the modulation of logits when sampling new tokens. It can have values > 0.f. Default is 1.0f
min_tokens (int) – Lower bound on the number of tokens to generate. Values < 1 have no effect. Default is 1.
min_length (int) – Lower bound on the number of tokens to generate. Values < 1 have no effect. Default is 1. This argument is being deprecated; please use min_tokens instead.
beam_search_diversity_rate (float) – Controls the diversity in beam search.
repetition_penalty (float) – Used to penalize tokens based on how often they appear in the sequence. It can have any value > 0.f. Values < 1.f encourages repetition, values > 1.f discourages it. Default is 1.f
presence_penalty (float) – Used to penalize tokens already present in the sequence (irrespective of the number of appearances). It can have any values. Values < 0.f encourage repetition, values > 0.f discourage it. Default is 0.f
frequency_penalty (float) – Used to penalize tokens already present in the sequence (dependent on the number of appearances). It can have any values. Values < 0.f encourage repetition, values > 0.f discourage it. Default is 0.f
length_penalty (float) – Controls how to penalize longer sequences in beam search. Default is 0.f
early_stopping (int) – Controls whether the generation process finishes once beamWidth sentences are generated (ends with end_token)
no_repeat_ngram_size (int) – Controls how many repeat ngram size are acceptable. Default is 1 << 30.
return_log_probs (bool) – Controls if Result should contain log probabilities. Default is false.
return_context_logits (bool) – Controls if Result should contain the context logits. Default is false.
return_generation_logits (bool) – Controls if Result should contain the generation logits. Default is false.
exclude_input_from_output (bool) – Controls if output tokens in Result should include the input tokens. Default is true.
return_encoder_output (bool) – Controls if Result should contain encoder output hidden states (for encoder-only and encoder-decoder models). Default is false.
return_perf_metrics (bool) – Controls if Result should contain the performance metrics for this request. Default is false.
lookahead_config (LookaheadDecodingConfig , optional) – Lookahead decoding config. Defaults to None.
guided_decoding (GuidedDecodingParams, optional) – Guided decoding params. Defaults to None.
ignore_eos (bool) – Whether to ignore the EOS token and continue generating tokens after the EOS token is generated. Defaults to False.
detokenize (bool) – Whether to detokenize the output. Defaults to True.
add_special_tokens (bool) – Whether to add special tokens to the prompt. Defaults to True.
truncate_prompt_tokens (int, optional) – If set to an integer k, will use only the last k tokens from the prompt (i.e., left truncation). Defaults to None.
skip_special_tokens (bool) – Whether to skip special tokens in the output. Defaults to True.
spaces_between_special_tokens (bool) – Whether to add spaces between special tokens in the output. Defaults to True.
- __init__(*, end_id: int | None = None, pad_id: int | None = None, max_tokens: int = 32, max_new_tokens: int | None = None, bad: List[str] | str | None = None, bad_token_ids: List[int] | None = None, stop: List[str] | str | None = None, stop_token_ids: List[int] | None = None, include_stop_str_in_output: bool = False, embedding_bias: Tensor | None = None, external_draft_tokens_config: ExternalDraftTokensConfig | None = None, logits_post_processor_name: str | None = None, n: int = 1, best_of: int | None = None, use_beam_search: bool = False, beam_width: int = 1, num_return_sequences: int | None = None, top_k: int | None = None, top_p: float | None = None, top_p_min: float | None = None, top_p_reset_ids: int | None = None, top_p_decay: float | None = None, seed: int | None = None, random_seed: int | None = None, temperature: float | None = None, min_tokens: int | None = None, min_length: int | None = None, beam_search_diversity_rate: float | None = None, repetition_penalty: float | None = None, presence_penalty: float | None = None, frequency_penalty: float | None = None, length_penalty: float | None = None, early_stopping: int | None = None, no_repeat_ngram_size: int | None = None, return_log_probs: bool = False, return_context_logits: bool = False, return_generation_logits: bool = False, exclude_input_from_output: bool = True, return_encoder_output: bool = False, return_perf_metrics: bool = False, lookahead_config: LookaheadDecodingConfig | None = None, guided_decoding: GuidedDecodingParams | None = None, ignore_eos: bool = False, detokenize: bool = True, add_special_tokens: bool = True, truncate_prompt_tokens: int | None = None, skip_special_tokens: bool = True, spaces_between_special_tokens: bool = True) None
- add_special_tokens: bool
- bad: List[str] | str | None
- bad_token_ids: List[int] | None
- beam_search_diversity_rate: float | None
- beam_width: int
- best_of: int | None
- detokenize: bool
- early_stopping: int | None
- embedding_bias: Tensor | None
- end_id: int | None
- exclude_input_from_output: bool
- external_draft_tokens_config: ExternalDraftTokensConfig | None
- frequency_penalty: float | None
- property greedy_decoding: bool
- guided_decoding: GuidedDecodingParams | None
- ignore_eos: bool
- include_stop_str_in_output: bool
- length_penalty: float | None
- logits_post_processor_name: str | None
- lookahead_config: LookaheadDecodingConfig | None
- max_new_tokens: int | None
- max_tokens: int
- min_length: int | None
- min_tokens: int | None
- n: int
- no_repeat_ngram_size: int | None
- num_return_sequences: int | None
- pad_id: int | None
- presence_penalty: float | None
- random_seed: int | None
- repetition_penalty: float | None
- return_context_logits: bool
- return_encoder_output: bool
- return_generation_logits: bool
- return_log_probs: bool
- return_perf_metrics: bool
- seed: int | None
- setup(tokenizer, add_special_tokens: bool = False) SamplingParams [source]
- skip_special_tokens: bool
- spaces_between_special_tokens: bool
- stop: List[str] | str | None
- stop_token_ids: List[int] | None
- temperature: float | None
- top_k: int | None
- top_p: float | None
- top_p_decay: float | None
- top_p_min: float | None
- top_p_reset_ids: int | None
- truncate_prompt_tokens: int | None
- use_beam_search: bool
- class tensorrt_llm.llmapi.KvCacheConfig
Bases:
pybind11_object
- __init__(self: tensorrt_llm.bindings.executor.KvCacheConfig, enable_block_reuse: bool = False, max_tokens: int | None = None, max_attention_window: list[int] | None = None, sink_token_length: int | None = None, free_gpu_memory_fraction: float | None = None, host_cache_size: int | None = None, onboard_blocks: bool = True, cross_kv_cache_fraction: float | None = None, secondary_offload_min_priority: int | None = None, event_buffer_max_size: int = 0, *, runtime_defaults: tensorrt_llm.bindings.executor.RuntimeDefaults | None = None) None
- property cross_kv_cache_fraction
- property enable_block_reuse
- property event_buffer_max_size
- fill_empty_fields_from_runtime_defaults(self: tensorrt_llm.bindings.executor.KvCacheConfig, arg0: tensorrt_llm.bindings.executor.RuntimeDefaults) None
- property free_gpu_memory_fraction
- property host_cache_size
- property max_attention_window
- property max_tokens
- property onboard_blocks
- property secondary_offload_min_priority
- property sink_token_length
- class tensorrt_llm.llmapi.LookaheadDecodingConfig
Bases:
pybind11_object
- __init__(self: tensorrt_llm.bindings.executor.LookaheadDecodingConfig, max_window_size: int, max_ngram_size: int, max_verification_set_size: int) None
- calculate_speculative_resource(self: tensorrt_llm.bindings.executor.LookaheadDecodingConfig) tuple[int, int, int, int]
- property max_ngram_size
- property max_verification_set_size
- property max_window_size
- class tensorrt_llm.llmapi.MedusaDecodingConfig(medusa_choices: List[List[int]] | None = None, num_medusa_heads: int | None = None)[source]
Bases:
object
- __init__(medusa_choices: List[List[int]] | None = None, num_medusa_heads: int | None = None) None
- medusa_choices: List[List[int]] | None = None
- num_medusa_heads: int | None = None
- class tensorrt_llm.llmapi.SchedulerConfig
Bases:
pybind11_object
- __init__(self: tensorrt_llm.bindings.executor.SchedulerConfig, capacity_scheduler_policy: tensorrt_llm.bindings.executor.CapacitySchedulerPolicy = CapacitySchedulerPolicy.GUARANTEED_NO_EVICT, context_chunking_policy: tensorrt_llm.bindings.executor.ContextChunkingPolicy | None = None, dynamic_batch_config: tensorrt_llm.bindings.executor.DynamicBatchConfig | None = None) None
- property capacity_scheduler_policy
- property context_chunking_policy
- property dynamic_batch_config
- class tensorrt_llm.llmapi.CapacitySchedulerPolicy
Bases:
pybind11_object
Members:
MAX_UTILIZATION
GUARANTEED_NO_EVICT
STATIC_BATCH
- GUARANTEED_NO_EVICT = <CapacitySchedulerPolicy.GUARANTEED_NO_EVICT: 1>
- MAX_UTILIZATION = <CapacitySchedulerPolicy.MAX_UTILIZATION: 0>
- STATIC_BATCH = <CapacitySchedulerPolicy.STATIC_BATCH: 2>
- __init__(self: tensorrt_llm.bindings.executor.CapacitySchedulerPolicy, value: int) None
- property name
- property value
- class tensorrt_llm.llmapi.BuildConfig(max_input_len: int = 1024, max_seq_len: int = None, opt_batch_size: int = 8, max_batch_size: int = 2048, max_beam_width: int = 1, max_num_tokens: int = 8192, opt_num_tokens: Optional[int] = None, max_prompt_embedding_table_size: int = 0, kv_cache_type: tensorrt_llm.bindings.KVCacheType = None, gather_context_logits: int = False, gather_generation_logits: int = False, strongly_typed: bool = True, force_num_profiles: Optional[int] = None, profiling_verbosity: str = 'layer_names_only', enable_debug_output: bool = False, max_draft_len: int = 0, speculative_decoding_mode: tensorrt_llm.models.modeling_utils.SpeculativeDecodingMode = <SpeculativeDecodingMode.NONE: 1>, use_refit: bool = False, input_timing_cache: str = None, output_timing_cache: str = 'model.cache', lora_config: tensorrt_llm.lora_manager.LoraConfig = <factory>, auto_parallel_config: tensorrt_llm.auto_parallel.config.AutoParallelConfig = <factory>, weight_sparsity: bool = False, weight_streaming: bool = False, plugin_config: tensorrt_llm.plugin.plugin.PluginConfig = <factory>, use_strip_plan: bool = False, max_encoder_input_len: int = 1024, use_fused_mlp: bool = True, dry_run: bool = False, visualize_network: bool = False, monitor_memory: bool = False, use_mrope: bool = False)[source]
Bases:
object
- __init__(max_input_len: int = 1024, max_seq_len: int | None = None, opt_batch_size: int = 8, max_batch_size: int = 2048, max_beam_width: int = 1, max_num_tokens: int = 8192, opt_num_tokens: int | None = None, max_prompt_embedding_table_size: int = 0, kv_cache_type: ~tensorrt_llm.bindings.KVCacheType | None = None, gather_context_logits: int = False, gather_generation_logits: int = False, strongly_typed: bool = True, force_num_profiles: int | None = None, profiling_verbosity: str = 'layer_names_only', enable_debug_output: bool = False, max_draft_len: int = 0, speculative_decoding_mode: ~tensorrt_llm.models.modeling_utils.SpeculativeDecodingMode = <SpeculativeDecodingMode.NONE: 1>, use_refit: bool = False, input_timing_cache: str | None = None, output_timing_cache: str = 'model.cache', lora_config: ~tensorrt_llm.lora_manager.LoraConfig = <factory>, auto_parallel_config: ~tensorrt_llm.auto_parallel.config.AutoParallelConfig = <factory>, weight_sparsity: bool = False, weight_streaming: bool = False, plugin_config: ~tensorrt_llm.plugin.plugin.PluginConfig = <factory>, use_strip_plan: bool = False, max_encoder_input_len: int = 1024, use_fused_mlp: bool = True, dry_run: bool = False, visualize_network: bool = False, monitor_memory: bool = False, use_mrope: bool = False) None
- auto_parallel_config: AutoParallelConfig
- dry_run: bool = False
- enable_debug_output: bool = False
- force_num_profiles: int | None = None
- gather_context_logits: int = False
- gather_generation_logits: int = False
- input_timing_cache: str = None
- kv_cache_type: KVCacheType = None
- lora_config: LoraConfig
- max_batch_size: int = 2048
- max_beam_width: int = 1
- max_draft_len: int = 0
- max_encoder_input_len: int = 1024
- max_input_len: int = 1024
- max_num_tokens: int = 8192
- max_prompt_embedding_table_size: int = 0
- max_seq_len: int = None
- monitor_memory: bool = False
- opt_batch_size: int = 8
- opt_num_tokens: int | None = None
- output_timing_cache: str = 'model.cache'
- plugin_config: PluginConfig
- profiling_verbosity: str = 'layer_names_only'
- speculative_decoding_mode: SpeculativeDecodingMode = 1
- strongly_typed: bool = True
- use_fused_mlp: bool = True
- use_mrope: bool = False
- use_refit: bool = False
- use_strip_plan: bool = False
- visualize_network: bool = False
- weight_sparsity: bool = False
- weight_streaming: bool = False
- class tensorrt_llm.llmapi.QuantConfig(quant_algo: QuantAlgo | None = None, kv_cache_quant_algo: QuantAlgo | None = None, group_size: int | None = 128, smoothquant_val: float = 0.5, clamp_val: List[float] | None = None, use_meta_recipe: bool = False, has_zero_point: bool | None = False, pre_quant_scale: bool | None = False, exclude_modules: List[str] | None = None)[source]
Bases:
object
Serializable quantization configuration class, part of the PretrainedConfig
- __init__(quant_algo: QuantAlgo | None = None, kv_cache_quant_algo: QuantAlgo | None = None, group_size: int | None = 128, smoothquant_val: float = 0.5, clamp_val: List[float] | None = None, use_meta_recipe: bool = False, has_zero_point: bool | None = False, pre_quant_scale: bool | None = False, exclude_modules: List[str] | None = None) None
- clamp_val: List[float] | None = None
- exclude_modules: List[str] | None = None
- group_size: int | None = 128
- has_zero_point: bool | None = False
- pre_quant_scale: bool | None = False
- property quant_mode: QuantModeWrapper
- property requires_calibration
- property requires_modelopt_quantization
- smoothquant_val: float = 0.5
- use_meta_recipe: bool = False
- property use_plugin_sq
- class tensorrt_llm.llmapi.QuantAlgo(value)[source]
Bases:
StrEnum
An enumeration.
- FP8 = 'FP8'
- FP8_PER_CHANNEL_PER_TOKEN = 'FP8_PER_CHANNEL_PER_TOKEN'
- INT8 = 'INT8'
- MIXED_PRECISION = 'MIXED_PRECISION'
- NO_QUANT = 'NO_QUANT'
- W4A16 = 'W4A16'
- W4A16_AWQ = 'W4A16_AWQ'
- W4A16_GPTQ = 'W4A16_GPTQ'
- W4A8_AWQ = 'W4A8_AWQ'
- W4A8_QSERVE_PER_CHANNEL = 'W4A8_QSERVE_PER_CHANNEL'
- W4A8_QSERVE_PER_GROUP = 'W4A8_QSERVE_PER_GROUP'
- W8A16 = 'W8A16'
- W8A16_GPTQ = 'W8A16_GPTQ'
- W8A8_SQ_PER_CHANNEL = 'W8A8_SQ_PER_CHANNEL'
- W8A8_SQ_PER_CHANNEL_PER_TENSOR_PLUGIN = 'W8A8_SQ_PER_CHANNEL_PER_TENSOR_PLUGIN'
- W8A8_SQ_PER_CHANNEL_PER_TOKEN_PLUGIN = 'W8A8_SQ_PER_CHANNEL_PER_TOKEN_PLUGIN'
- W8A8_SQ_PER_TENSOR_PER_TOKEN_PLUGIN = 'W8A8_SQ_PER_TENSOR_PER_TOKEN_PLUGIN'
- W8A8_SQ_PER_TENSOR_PLUGIN = 'W8A8_SQ_PER_TENSOR_PLUGIN'
- class tensorrt_llm.llmapi.CalibConfig(device: Literal['cuda', 'cpu'] = 'cuda', calib_dataset: str = 'cnn_dailymail', calib_batches: int = 512, calib_batch_size: int = 1, calib_max_seq_length: int = 512, random_seed: int = 1234, tokenizer_max_seq_length: int = 2048)[source]
Bases:
object
Calibration configuration.
- Parameters:
device (Literal['cuda', 'cpu'], default='cuda') – The device to run calibration.
calib_dataset (str, default='cnn_dailymail') – The name or local path of calibration dataset.
calib_batches (int, default=512) – The number of batches that the calibration runs.
calib_batch_size (int, default=1) – The batch size that the calibration runs.
calib_max_seq_length (int, default=512) – The maximum sequence length that the calibration runs.
random_seed (int, default=1234) – The random seed used for calibration.
tokenizer_max_seq_length (int, default=2048) – The maximum sequence length to initialize tokenizer for calibration.
- __init__(device: Literal['cuda', 'cpu'] = 'cuda', calib_dataset: str = 'cnn_dailymail', calib_batches: int = 512, calib_batch_size: int = 1, calib_max_seq_length: int = 512, random_seed: int = 1234, tokenizer_max_seq_length: int = 2048) None
- calib_batch_size: int
- calib_batches: int
- calib_dataset: str
- calib_max_seq_length: int
- device: Literal['cuda', 'cpu']
- random_seed: int
- tokenizer_max_seq_length: int
- class tensorrt_llm.llmapi.BuildCacheConfig(cache_root: Path | None = None, max_records: int = 10, max_cache_storage_gb: float = 256)[source]
Bases:
object
Configuration for the build cache.
- cache_root
The root directory for the build cache.
- Type:
str
- max_records
The maximum number of records to store in the cache.
- Type:
int
- max_cache_storage_gb
The maximum amount of storage (in GB) to use for the cache.
- Type:
float
Note
The build-cache assumes the weights of the model are not changed during the execution. If the weights are changed, you should remove the caches manually.
- __init__(cache_root: Path | None = None, max_records: int = 10, max_cache_storage_gb: float = 256)[source]
- property cache_root: Path
- property max_cache_storage_gb: float
- property max_records: int