Scheduler#

TensorRT-LLM PyTorch backend employs inflight batching, a mechanism where batching and scheduling occur dynamically at each LLM step. The scheduler is invoked to determine which requests are scheduled at the current step.

Scheduler Introduction#

There are two kinds of schedulers:

CapacityScheduler: This scheduler decides if resources should be allocated for each active request. It considers the KV cache capacity and other resources, if applicable. The input to CapacityScheduler includes all active requests that need processing. The primary output is fitting_requests, representing the requests for which resources are reserved at the current step. Another output is paused_requests, which supports request pausing in the C++ runtime.
MicroBatchScheduler: This scheduler selects some requests from fitting_requests chosen by CapacityScheduler. Another input is inflight_request_ids, which supports pipeline parallelism or overlapped execution in the C++ runtime. Since PyTorch Flow does not support pipeline parallelism, inflight_request_ids is an empty set. The outputs are context_requests and generation_requests, which are the scheduled context and generation requests. Requests not in these lists are not selected for the model forward pass.

SimpleScheduler combines these two schedulers, first using CapacityScheduler and then MicroBatchScheduler, to get the final schedule result. The inputs to SimpleScheduler include active_requests and inflight_request_ids, and the outputs are context_requests, generation_requests, and paused_requests.

Customize Your Own Scheduler#

To customize the scheduler or batching mechanism, implement your own CapacityScheduler and MicroBatchScheduler by inheriting their respective classes. If two-step scheduling is unnecessary, inherit RequestScheduler and implement schedule_request directly.

An example of a CapacityScheduler implementation is the GuaranteedNoEvictScheduler class, found in scheduler.py. This class was used before the C++ binding of CapacityScheduler and initially employed a Python-based scheduler. It inherits CapacityScheduler and implements its own schedule_request method. This method processes all active_requests and tries to schedule more requests that can fit in the KV cache. Resource estimation should align with resource allocation and deallocation in kv_cache_manager.

Here is the code snippet:

class GuaranteedNoEvictScheduler(CapacityScheduler):
    # only schedule requests has no_schedule_until_state <= state < no_schedule_after_state
    no_schedule_until_state = LlmRequestState.CONTEXT_INIT
    no_schedule_after_state = LlmRequestState.GENERATION_COMPLETE

    def __init__(self, max_num_requests: int, kv_cache_manager):
        super(GuaranteedNoEvictScheduler, self).__init__()
        self.max_num_requests = max_num_requests
        self.kv_cache_manager = kv_cache_manager

    def schedule_request(
        self, active_requests: RequestList
    ) -> tuple[list[LlmRequest], list[LlmRequest]]:
        scheduled_requests = []
        pending_requests = []
        reserved_blocks = 0
        max_blocks = self.kv_cache_manager.get_max_resource_count()
        for request in active_requests:
            req_state = request.state
            # if request cannot be scheduled yet or request should no longer be scheduled, skip
            if req_state.value < self.no_schedule_until_state.value or req_state.value >= self.no_schedule_after_state.value:
                continue

            if len(scheduled_requests
                   ) >= self.max_num_requests or reserved_blocks >= max_blocks:
                break
            elif req_state == LlmRequestState.GENERATION_IN_PROGRESS or req_state == LlmRequestState.GENERATION_TO_COMPLETE:
                scheduled_requests.append(request)
                reserved_blocks += self.kv_cache_manager.get_needed_resource_to_completion(
                    request)
            else:
                pending_requests.append(request)

        avaiable_blocks = max_blocks - reserved_blocks
        for request in pending_requests:
            req_state = request.state
            if len(scheduled_requests) >= self.max_num_requests:
                break
            elif req_state == LlmRequestState.CONTEXT_INIT:
                needed_blocks = self.kv_cache_manager.get_needed_resource_to_completion(
                    request)
                if needed_blocks <= avaiable_blocks:
                    scheduled_requests.append(request)
                    avaiable_blocks -= needed_blocks
                elif needed_blocks > avaiable_blocks:
                    # If one requests fails to be scheduled, break
                    break

        assert len(scheduled_requests) > 0, (
            "no pending request can get enough resource to complete, "
            "please increase KV cache pool size.")
        return scheduled_requests, []

After implementing your own scheduler, integrate it into the PyExecutor. For the PyTorch backend, the code is in py_executor_creator.py. In the create_pytorch_model_based_executor function, there are two lines creating CapacityScheduler:

    capacitor_scheduler = BindCapacityScheduler(max_num_requests,
                                                kv_cache_manager.impl)

Similar adjustments can be made for MicroBatchScheduler. This allows the PyExecutor to execute with your customized scheduling logic.