Scheduler
TensorRT-LLM PyTorch backend employs inflight batching, a mechanism where batching and scheduling occur dynamically at each LLM step. The scheduler is invoked to determine which requests are scheduled at the current step.
Scheduler Introduction
There are two kinds of schedulers:
CapacityScheduler
: This scheduler decides if resources should be allocated for each active request. It considers the KV cache capacity and other resources, if applicable. The input toCapacityScheduler
includes all active requests that need processing. The primary output isfitting_requests
, representing the requests for which resources are reserved at the current step. Another output ispaused_requests
, which supports request pausing in the C++ runtime.MicroBatchScheduler
: This scheduler selects some requests fromfitting_requests
chosen byCapacityScheduler
. Another input isinflight_request_ids
, which supports pipeline parallelism or overlapped execution in the C++ runtime. Since PyTorch Flow does not support pipeline parallelism,inflight_request_ids
is an empty set. The outputs arecontext_requests
andgeneration_requests
, which are the scheduled context and generation requests. Requests not in these lists are not selected for the model forward pass.
SimpleScheduler
combines these two schedulers, first using CapacityScheduler
and then MicroBatchScheduler
, to get the final schedule result.
The inputs to SimpleScheduler
include active_requests
and inflight_request_ids
, and the outputs are context_requests
, generation_requests
, and paused_requests
.
Customize Your Own Scheduler
To customize the scheduler or batching mechanism, implement your own CapacityScheduler
and MicroBatchScheduler
by inheriting their respective classes.
If two-step scheduling is unnecessary, inherit RequestScheduler
and implement schedule_request
directly.
An example of a CapacityScheduler
implementation is the GuaranteedNoEvictScheduler
class, found in scheduler.py.
This class was used before the C++ binding of CapacityScheduler
and initially employed a Python-based scheduler.
It inherits CapacityScheduler
and implements its own schedule_request
method.
This method processes all active_requests
and tries to schedule more requests that can fit in the KV cache.
Resource estimation should align with resource allocation and deallocation in kv_cache_manager
.
Here is the code snippet:
class GuaranteedNoEvictScheduler(CapacityScheduler):
# only schedule requests has no_schedule_until_state <= state < no_schedule_after_state
no_schedule_until_state = LlmRequestState.CONTEXT_INIT
no_schedule_after_state = LlmRequestState.GENERATION_COMPLETE
def __init__(self, max_num_requests: int, kv_cache_manager):
super(GuaranteedNoEvictScheduler, self).__init__()
self.max_num_requests = max_num_requests
self.kv_cache_manager = kv_cache_manager
def schedule_request(
self, active_requests: RequestList
) -> tuple[list[LlmRequest], list[LlmRequest]]:
scheduled_requests = []
pending_requests = []
reserved_blocks = 0
max_blocks = self.kv_cache_manager.get_max_resource_count()
for request in active_requests:
req_state = request.state
# if request cannot be scheduled yet or request should no longer be scheduled, skip
if req_state.value < self.no_schedule_until_state.value or req_state.value >= self.no_schedule_after_state.value:
continue
if len(scheduled_requests
) >= self.max_num_requests or reserved_blocks >= max_blocks:
break
elif req_state == LlmRequestState.GENERATION_IN_PROGRESS or req_state == LlmRequestState.GENERATION_TO_COMPLETE:
scheduled_requests.append(request)
reserved_blocks += self.kv_cache_manager.get_needed_resource_to_completion(
request)
else:
pending_requests.append(request)
avaiable_blocks = max_blocks - reserved_blocks
for request in pending_requests:
req_state = request.state
if len(scheduled_requests) >= self.max_num_requests:
break
elif req_state == LlmRequestState.CONTEXT_INIT:
needed_blocks = self.kv_cache_manager.get_needed_resource_to_completion(
request)
if needed_blocks <= avaiable_blocks:
scheduled_requests.append(request)
avaiable_blocks -= needed_blocks
elif needed_blocks > avaiable_blocks:
# If one requests fails to be scheduled, break
break
assert len(scheduled_requests) > 0, (
"no pending request can get enough resource to complete, "
"please increase KV cache pool size.")
return scheduled_requests, []
After implementing your own scheduler, integrate it into the PyExecutor.
For the PyTorch backend, the code is in pytorch_model_registry.py.
In the create_pytorch_model_based_executor
function, there are two lines creating CapacityScheduler
:
capacitor_scheduler = BindCapacityScheduler(max_num_requests,
kv_cache_manager.impl)
Similar adjustments can be made for MicroBatchScheduler
. This allows the PyExecutor
to execute with your customized scheduling logic.