Disaggregated Inference Benchmark Scripts#
This directory contains scripts to run disaggregated inference benchmarks using TensorRT-LLM and SLURM.
Overview#
The benchmarking process is orchestrated through a set of shell scripts and a Python script that work together:
submit.sh
: The main entry point for submitting benchmark jobs to SLURM. It runs a parameter sweep by callingsbatch
with different configurations.disaggr_torch.slurm
: The SLURM script that sets up and runs a single benchmark experiment. It launches a container, generates a configuration file, starts the server and workers, and runs the benchmark client.gen_yaml.py
: A Python script that generates theconfig.yaml
file needed bytrtllm-serve
. It determines the server and worker configuration based on SLURM environment variables and script arguments.start_worker.sh
: A shell script responsible for starting atrtllm-serve disaggregated_mpi_worker
on each allocated machine.run_benchmark.sh
: A shell script that waits for the server to be healthy and then runs the actual benchmark client (run_benchmark.py
, not included in this directory).
File Descriptions#
submit.sh
#
This script is used to submit multiple SLURM jobs for running benchmarks with different parameters. It iterates through various configurations and uses sbatch
to submit disaggr_torch.slurm
for each one.
Usage:
./submit.sh
You can modify the loops in this script to change the parameter space for the benchmark sweep.
disaggr_torch.slurm
#
This is the core SLURM script for a single benchmark run. It is not meant to be run directly, but rather submitted via sbatch
(e.g., by submit.sh
).
It takes the following arguments in order:
num_ctx_servers
: Number of context servers.ctx_tp_size
: Tensor parallel size for context servers.ctx_batch_size
: Max batch size for context servers.ctx_max_num_tokens
: Max number of tokens for context servers.ctx_enable_attention_dp
:true
orfalse
to enable attention DP for context servers.num_gen_servers
: Number of generation servers.gen_tp_size
: Tensor parallel size for generation servers.gen_batch_size
: Max batch size for generation servers.gen_max_num_tokens
: Max number of tokens for generation servers.gen_enable_attention_dp
:true
orfalse
to enable attention DP for generation servers.gen_gpu_memory_fraction
: GPU memory fraction for generation servers.concurrency_list
: A space-separated list of concurrencies to test (e.g., “1 2 4 8”).sub_file
: A subdirectory name for logs.
gen_yaml.py
#
This Python script generates the config.yaml
file that configures the trtllm-serve
application. It reads SLURM environment variables (SLURM_JOB_NODELIST
, SLURM_TASKS_PER_NODE
) to distribute workers across nodes.
Usage:
The script is called from within disaggr_torch.slurm
. It takes numerous arguments to define the model, parallelism, and server configurations.
start_worker.sh
#
This script starts a trtllm-serve disaggregated_mpi_worker
. It is launched by srun
from the disaggr_torch.slurm
script on all allocated nodes.
Arguments:
config_file
: Path to theconfig.yaml
file.enable_pdl
:true
orfalse
.ctx_gpus
: Number of GPUs used for the context phase.work_dir
: (Optional) Directory to store nsys profiling output.
run_benchmark.sh
#
This script orchestrates the execution of the benchmark client. It waits for the config.yaml
to be created and for the server’s /health
endpoint to respond, then it runs the benchmark.
Arguments:
isl
: Input sequence length.osl
: Output sequence length.multi_round
: Number of rounds for the benchmark.model_name
: Name of the model being benchmarked.concurrency_list
: Space-separated list of concurrencies.streaming
:true
orfalse
.log_path
: Path to the log directory.
Workflow#
The user runs
./submit.sh
.submit.sh
submits one or more jobs to SLURM by callingsbatch disaggr_torch.slurm
with different parameters.For each job, SLURM allocates resources and runs
disaggr_torch.slurm
.disaggr_torch.slurm
runsgen_yaml.py
to create aconfig.yaml
.disaggr_torch.slurm
usessrun
to launchstart_worker.sh
on all nodes, starting the MPI workers.disaggr_torch.slurm
starts the maintrtllm-serve
process.disaggr_torch.slurm
runsrun_benchmark.sh
which waits for the server to be ready.run_benchmark.sh
executes the benchmark for each concurrency level specified.After the benchmark,
run_benchmark.sh
anddisaggr_torch.slurm
attempt to kill the server and worker processes.Logs for each run are stored in a subdirectory specified by the
sub_file
parameter.