Useful Runtime Options#

This part summarizes the runtime configuration knobs that can be tweaked to enhance the performance of already built engines. As compared to previous examples where the LLM-API was used to build and save an engine but not to process any requests, runtime knobs would be specified when you are using the LLM-API to actually run inference like in the LLM-API end-to-end example

Capacity Scheduler Policy#

TensorRT-LLM currently supports three batch scheduler policies: GUARANTEED_NO_EVICT (default), MAX_UTILIZATION and STATIC_BATCH.

The scheduling policy can be set to MAX_UTILIZATION to pack as many requests as possible at each iteration of the forward loop, when in-flight sequence batching is enabled. It maximizes the utilization of the GPUs by aggressively scheduling requests at the risk of having to pause requests if the KV cache size limit is reached.

For a more conservative approach with respect to the KV cache limitations in terms of memory allocation, CapacitySchedulerPolicy should be set to GUARANTEED_NO_EVICT to guarantee that a started request is never paused.

If the goal is to maximizes the throughput, users should try MAX_UTILIZATION. However, they need to keep in mind that it may have a negative impact on latency if requests have to be paused.

STATIC_BATCH is a legacy mode and is not recommended for production usage.

To switch the capacity scheduler policy from the default of GUARANTEED_NO_EVICT to MAX_UTILIZATION you would modify the LLM-API end-to-end example to be:

from tensorrt_llm import LLM, SamplingParams
from tensorrt_llm.bindings.executor import SchedulerConfig, CapacitySchedulerPolicy


def main():
    prompts = [
        "Hello, I am",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
    ]

    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

    scheduler_config = SchedulerConfig(
        capacity_scheduler_policy=CapacitySchedulerPolicy.MAX_UTILIZATION
    )

    llm  =  LLM(
    model="meta-llama/Llama-3.3-70B-Instruct",
    tensor_parallel_size=4,
    scheduler_config=scheduler_config
    )

    outputs = llm.generate(prompts, sampling_params)

    # Print the outputs.
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

if __name__ == '__main__':
    main()

Context Chunking Policy#

As discussed previously context chunking will increase the chance of batch processing between the context and the generation phase, thereby balancing the calculation amount of each iteration and typically increasing throughput.

TensorRT-LLM currently supports two context chunking policies: FIRST_COME_FIRST_SERVED (default) which would prioritize scheduling all the context chunks of a request that comes in first, and EQUAL_PROGRESS which schedules context chunks from all requests before scheduling the next chunk of any request.

FIRST_COME_FIRST_SERVED should achieve overall better performance, while EQUAL_PROGRESS can be helpful in theory to make sure time to first token (TTFT) for most requests are relatively similar.

To switch the context chunking policy from the default of FIRST_COME_FIRST_SERVED to EQUAL_PROGRESS you would modify the LLM-API end-to-end example to be:

from tensorrt_llm import LLM, SamplingParams
from tensorrt_llm.bindings.executor import SchedulerConfig, ContextChunkingPolicy


def main():
    prompts = [
        "Hello, I am",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
    ]

    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

    scheduler_config = SchedulerConfig(
        context_chunking_policy=ContextChunkingPolicy.EQUAL_PROGRESS
    )

    llm  =  LLM(
    model="meta-llama/Llama-3.3-70B-Instruct",
    tensor_parallel_size=4,
    scheduler_config=scheduler_config
    )

    outputs = llm.generate(prompts, sampling_params)

    # Print the outputs.
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

if __name__ == '__main__':
    main()

Max Tokens in Paged KV Cache and KV Cache Free GPU Memory Fraction#

The max_tokens_in_paged_kv_cache and kv_cache_free_gpu_mem_fraction parameters can be used to control the maximum number of tokens handled by the KV cache manager. Setting them properly helps better control the amount of available memory for the KV cache manager during inference. Keeping in mind that increasing the amount of memory available to the KV cache manager tends to translate to a higher achievable throughput.

The max_tokens_in_paged_kv_cache flag directly sets the maximum number of tokens in the KV cache manager. When left unset, that value will be computed based on the kv_cache_free_gpu_mem_fraction setting.

The kv_cache_free_gpu_mem_fraction is a floating-point number between 0.0 and 1.0 that indicates the maximum fraction of GPU memory (after loading the model) that will be used for the KV cache. The default value is 0.90 and means that 90% of the free GPU memory will be used to save tokens in the KV cache. Based on that value, TensorRT-LLM can determine the maximum number of tokens in the KV cache manager.

When both parameters are set, the maximum number of tokens in the KV cache manager will be set to the smaller value between max_tokens_in_paged_kv_cache and the value computed from the amount of memory available for the KV cache.

Unless users clearly know the maximum number of tokens in the KV cache needed by the model, it is recommended to leave max_tokens_in_paged_kv_cache unset. For kv_cache_free_gpu_mem_fraction, if no other programs are executed on the same GPU, it is recommended to test with a as high value as 0.95 to target a high throughput. Note that the kv_cache_free_gpu_mem_fraction parameter cannot be set to 1.0 because some amount of memory has to be reserved for inputs and outputs.

To set kv_cache_free_gpu_mem_fraction you would modify the LLM-API end-to-end example to be:

from tensorrt_llm import LLM, SamplingParams
from tensorrt_llm.bindings.executor import KvCacheConfig


def main():
    prompts = [
        "Hello, I am",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
    ]

    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

    kv_cache_config = KvCacheConfig(free_gpu_memory_fraction=0.95)

    llm  =  LLM(
    model="meta-llama/Llama-3.3-70B-Instruct",
    tensor_parallel_size=8,
    kv_cache_config=kv_cache_config
    )

    outputs = llm.generate(prompts, sampling_params)

    # Print the outputs.
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

if __name__ == '__main__':
    main()

If you wanted to set max_tokens_in_paged_kv_cache instead, you would replace free_gpu_memory_fraction with max_tokens and specify the number.

    kv_cache_config = KvCacheConfig(max_tokens=<number of tokens>)

Maximum Attention Window Size#

The max_attention_window_size flag sets the maximum number of tokens that are attended to in order to generate one token when using techniques like sliding window attention. See this Document for more details. It defaults to the maximum sequence length (max_seq_len when building the engine), which means that the feature is disabled by default.

When set to a smaller value than max_seq_len (during engine build), only the KV cache of the last max_attention_window_size tokens will be stored. If the input sequence length at runtime exceeds the max_attention_window_size value, the accuracy may start dropping, but the runtime performance will be better (due to the reduction in terms of computations and GPU memory allocation). Users can modify that value to increase runtime performance at the expense of reduced accuracy.

Just like kv_cache_free_gpu_mem_fraction, max_attention_window_size can be specified in the LLM-API via KVCacheConfig. To specify max_attention_window_size you would instantiate KVCacheConfig like so

    kv_cache_config = KvCacheConfig(max_attention_window=<number of tokens>)