Deployment Guide for Qwen3.5 on TensorRT LLM - Blackwell & Hopper Hardware#
Introduction#
This deployment guide provides step-by-step instructions for running the Qwen3.5-397B-A17B model using TensorRT LLM. It covers model access, environment setup, server configuration, and inference validation.
Prerequisites#
GPU: NVIDIA Blackwell or Hopper Architecture
OS: Linux
Drivers: CUDA Driver 575 or Later
Docker with NVIDIA Container Toolkit installed
Python3 and python3-pip (Optional, for accuracy evaluation only)
Models#
Qwen/Qwen3.5-397B-A17B (base, BF16)
GPU Requirements#
The NVFP4 checkpoint is the recommended (and minimum-footprint) deployment precision for Qwen3.5. It quantizes the linear layers in the MoE blocks to NVFP4 and uses an FP8 KV cache.
Platform |
Minimum GPUs |
|---|---|
B200 |
4x B200 |
B300 |
4x B300 |
GB200 |
4x GB200 |
GB300 |
4x GB300 |
The NVFP4 checkpoint has been validated on B200 with tensor_parallel_size = 4. A single node of 4 Blackwell GPUs fits the NVFP4 weights plus the KV cache with headroom.
Deployment Steps#
Run Docker Container#
Run the docker container using the TensorRT LLM NVIDIA NGC image.
docker run --rm -it \
--ipc=host \
--gpus all \
-p 8000:8000 \
-v ~/.cache:/root/.cache:rw \
--name tensorrt_llm \
nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc19 \
/bin/bash
Note:
The command mounts your user
.cachedirectory to save the downloaded model checkpoints which are saved to~/.cache/huggingface/hub/by default. This prevents having to redownload the weights each time you rerun the container. If the~/.cachedirectory doesn’t exist please create it using$ mkdir ~/.cache.You can mount additional directories and paths using the
-v <host_path>:<container_path>flag if needed, such as mounting the downloaded weight paths.The command also maps port
8000from the container to your host so you can access the LLM API endpoint from your host.See the https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags for all the available containers. The containers published in the main branch weekly have
rcNsuffix, while the monthly release with QA tests has norcNsuffix. Use thercrelease to get the latest model and feature support.
If you want to use latest main branch, you can choose to build from source to install TensorRT LLM, the steps refer to https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source.html
Recommended Performance Settings#
We maintain YAML configuration files with recommended performance settings in the examples/configs directory. These config files are present in the TensorRT LLM container at the path /app/tensorrt_llm/examples/configs. You can use these out-of-the-box, or adjust them to your specific use case.
TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment
EXTRA_LLM_API_FILE=${TRTLLM_DIR}/examples/configs/curated/qwen3.5.yaml
Note: if you don’t have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below.
Show code
EXTRA_LLM_API_FILE=/tmp/config.yml
cat << EOF > ${EXTRA_LLM_API_FILE}
max_batch_size: 512
max_num_tokens: 2048
tensor_parallel_size: 4
moe_expert_parallel_size: 4
trust_remote_code: true
enable_attention_dp: true
cuda_graph_config:
enable_padding: true
max_batch_size: 256
moe_config:
backend: CUTEDSL
kv_cache_config:
free_gpu_memory_fraction: 0.8
enable_block_reuse: false
num_postprocess_workers: 4
EOF
The config is a starting point tuned for max throughput on 4x B200; adjust the parallelism, batch sizes, and KV cache fraction to match your hardware and traffic pattern.
Launch the TensorRT LLM Server#
Below is an example command to launch the TensorRT LLM server with the Qwen3.5 NVFP4 model from within the container.
trtllm-serve nvidia/Qwen3.5-397B-A17B-NVFP4 --host 0.0.0.0 --port 8000 --reasoning_parser qwen3_5 --tool_parser qwen3 --config ${EXTRA_LLM_API_FILE}
Qwen3.5 uses the qwen3_5 reasoning parser (its chat template pre-injects a <think> block, so reasoning starts at the beginning of the response). The qwen3 tool parser handles the Qwen3 function-call format.
After the server is set up, the client can now send prompt requests to the server and receive results.
LLM API Options (YAML Configuration)#
These options provide control over TensorRT LLM’s behavior and are set within the YAML file passed to the trtllm-serve command via the --config argument.
tensor_parallel_size#
Description: Sets the tensor-parallel size. This should typically match the number of GPUs you intend to use for a single model instance.
moe_expert_parallel_size#
Description: Sets the expert-parallel size for Mixture-of-Experts (MoE) models. Like
tensor_parallel_size, this should generally match the number of GPUs you’re using. This setting has no effect on non-MoE models.
enable_attention_dp#
Description: Enables attention data parallelism for the attention/linear-attention layers while keeping the MoE expert-parallel. This generally improves throughput at high concurrency and long context.
kv_cache_config.free_gpu_memory_fraction#
Description: A value between
0.0and1.0that specifies the fraction of free GPU memory to reserve for the KV cache after the model is loaded. Since memory usage can fluctuate, this buffer helps prevent out-of-memory (OOM) errors.Recommendation: If you experience OOM errors, try reducing this value to
0.7or lower.
max_batch_size#
Description: The maximum number of user requests that can be grouped into a single batch for processing. The actual max batch size that can be achieved depends on total sequence length (input + output).
max_num_tokens#
Description: The maximum total number of tokens (across all requests) allowed inside a single scheduled batch.
max_seq_len#
Description: The maximum possible sequence length for a single request, including both input and generated output tokens. We won’t specifically set it. It will be inferred from model config.
trust_remote_code#
Description: Allows TensorRT LLM to download models and tokenizers from Hugging Face. This flag is passed directly to the Hugging Face API.
cuda_graph_config#
Description: A section for configuring CUDA graphs to optimize performance.
Options:
enable_padding: If"true", input batches are padded to the nearestcuda_graph_batch_size. This can significantly improve performance.Default:
falsemax_batch_size: Sets the maximum batch size for which a CUDA graph will be created.Default:
0Recommendation: Set this to the same value as the
--max_batch_sizecommand-line option.
moe_config#
Description: Configuration for Mixture-of-Experts (MoE) models.
Options:
backend: The backend to use for MoE operations. Default:CUTLASS
See the TorchLlmArgs class for the full list of options which can be used in the YAML configuration file.
Testing API Endpoint#
Basic Test#
Start a new terminal on the host to test the TensorRT LLM server you just launched.
You can query the health/readiness of the server using:
curl -s -o /dev/null -w "Status: %{http_code}\n" "http://localhost:8000/health"
When the Status: 200 code is returned, the server is ready for queries. Note that the very first query may take longer due to initialization and compilation.
After the TensorRT LLM server is set up and shows Application startup complete, you can send requests to the server.
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "nvidia/Qwen3.5-397B-A17B-NVFP4",
"messages": [
{
"role": "user",
"content": "Where is New York?"
}
],
"max_tokens": 1024,
"top_p": 1.0
}' -w "\n"
Troubleshooting Tips#
If you encounter CUDA out-of-memory errors, try reducing
max_batch_size,max_num_tokens, orkv_cache_config.free_gpu_memory_fraction.Ensure your model checkpoints are compatible with the expected format.
For performance issues, check GPU utilization with
nvidia-smiwhile the server is running.If the container fails to start, verify that the NVIDIA Container Toolkit is properly installed.
For connection issues, make sure the server port (
8000in this guide) is not being used by another application.Reasoning is controlled with
--reasoning_parser qwen3_5. To toggle thinking per request, passenable_thinkingthroughchat_template_kwargsin the request body, for example{"chat_template_kwargs": {"enable_thinking": true}}(set it tofalseto disable reasoning).
Benchmarking Performance#
To benchmark the performance of your TensorRT LLM server you can leverage the built-in benchmark_serving.py script. To do this, first create a wrapper bench.sh script.
cat <<'EOF' > bench.sh
#!/usr/bin/env bash
set -euo pipefail
MODEL_NAME="nvidia/Qwen3.5-397B-A17B-NVFP4"
concurrency_list="1 2 4 8 16 32 64 128 256"
multi_round=5
isl=1024
osl=1024
result_dir=/tmp/qwen3_5_output
for concurrency in ${concurrency_list}; do
num_prompts=$((concurrency * multi_round))
python -m tensorrt_llm.serve.scripts.benchmark_serving \
--model ${MODEL_NAME} \
--backend openai \
--dataset-name "random" \
--random-input-len ${isl} \
--random-output-len ${osl} \
--random-prefix-len 0 \
--random-ids \
--num-prompts ${num_prompts} \
--max-concurrency ${concurrency} \
--ignore-eos \
--tokenize-on-client \
--percentile-metrics "ttft,tpot,itl,e2el"
done
EOF
chmod +x bench.sh
To achieve max throughput, with attention DP on, one needs to sweep up to concurrency = max_batch_size * num_gpus.
If you want to save the results to a file add the following options.
--save-result \
--result-dir "${result_dir}" \
--result-filename "concurrency_${concurrency}.json"
For more benchmarking options see benchmark_serving.py
Run bench.sh to begin a serving benchmark. This will take a long time if you run all the concurrencies mentioned in the above bench.sh script.
./bench.sh