Disaggregated Inference Benchmark Scripts#
This directory contains scripts to run disaggregated inference benchmarks using TensorRT-LLM and SLURM.
Overview#
The benchmarking process is orchestrated through a set of shell scripts and a Python script that work together:
submit.sh: The main entry point for submitting benchmark jobs to SLURM. It runs a parameter sweep by callingsbatchwith different configurations.disaggr_torch.slurm: The SLURM script that sets up and runs a single benchmark experiment. It launches a container, generates a configuration file, starts the server and workers, and runs the benchmark client.gen_yaml.py: A Python script that generates theconfig.yamlfile needed bytrtllm-serve. It determines the server and worker configuration based on SLURM environment variables and script arguments.start_worker.sh: A shell script responsible for starting atrtllm-serve disaggregated_mpi_workeron each allocated machine.run_benchmark.sh: A shell script that waits for the server to be healthy and then runs the actual benchmark client (run_benchmark.py, not included in this directory).
File Descriptions#
submit.sh#
This script is used to submit multiple SLURM jobs for running benchmarks with different parameters. It iterates through various configurations and uses sbatch to submit disaggr_torch.slurm for each one.
Usage:
./submit.sh
You can modify the loops in this script to change the parameter space for the benchmark sweep.
disaggr_torch.slurm#
This is the core SLURM script for a single benchmark run. It is not meant to be run directly, but rather submitted via sbatch (e.g., by submit.sh).
It takes the following arguments in order:
num_ctx_servers: Number of context servers.ctx_tp_size: Tensor parallel size for context servers.ctx_batch_size: Max batch size for context servers.ctx_max_num_tokens: Max number of tokens for context servers.ctx_enable_attention_dp:trueorfalseto enable attention DP for context servers.num_gen_servers: Number of generation servers.gen_tp_size: Tensor parallel size for generation servers.gen_batch_size: Max batch size for generation servers.gen_max_num_tokens: Max number of tokens for generation servers.gen_enable_attention_dp:trueorfalseto enable attention DP for generation servers.gen_gpu_memory_fraction: GPU memory fraction for generation servers.concurrency_list: A space-separated list of concurrencies to test (e.g., “1 2 4 8”).sub_file: A subdirectory name for logs.
gen_yaml.py#
This Python script generates the config.yaml file that configures the trtllm-serve application. It reads SLURM environment variables (SLURM_JOB_NODELIST, SLURM_TASKS_PER_NODE) to distribute workers across nodes.
Usage:
The script is called from within disaggr_torch.slurm. It takes numerous arguments to define the model, parallelism, and server configurations.
start_worker.sh#
This script starts a trtllm-serve disaggregated_mpi_worker. It is launched by srun from the disaggr_torch.slurm script on all allocated nodes.
Arguments:
config_file: Path to theconfig.yamlfile.enable_pdl:trueorfalse.ctx_gpus: Number of GPUs used for the context phase.work_dir: (Optional) Directory to store nsys profiling output.
run_benchmark.sh#
This script orchestrates the execution of the benchmark client. It waits for the config.yaml to be created and for the server’s /health endpoint to respond, then it runs the benchmark.
Arguments:
isl: Input sequence length.osl: Output sequence length.multi_round: Number of rounds for the benchmark.model_name: Name of the model being benchmarked.concurrency_list: Space-separated list of concurrencies.streaming:trueorfalse.log_path: Path to the log directory.
Workflow#
The user runs
./submit.sh.submit.shsubmits one or more jobs to SLURM by callingsbatch disaggr_torch.slurmwith different parameters.For each job, SLURM allocates resources and runs
disaggr_torch.slurm.disaggr_torch.slurmrunsgen_yaml.pyto create aconfig.yaml.disaggr_torch.slurmusessrunto launchstart_worker.shon all nodes, starting the MPI workers.disaggr_torch.slurmstarts the maintrtllm-serveprocess.disaggr_torch.slurmrunsrun_benchmark.shwhich waits for the server to be ready.run_benchmark.shexecutes the benchmark for each concurrency level specified.After the benchmark,
run_benchmark.shanddisaggr_torch.slurmattempt to kill the server and worker processes.Logs for each run are stored in a subdirectory specified by the
sub_fileparameter.