Deployment Guide for Kimi K2 Thinking on TensorRT LLM - Blackwell#

Introduction#

This is a quickstart guide for running the Kimi K2 Thinking model on TensorRT LLM. It focuses on a working setup with recommended defaults.

Prerequisites#

  • GPU: NVIDIA Blackwell Architecture

  • OS: Linux

  • Drivers: CUDA Driver 575 or Later

  • Docker with NVIDIA Container Toolkit installed

  • Python3 and python3-pip (Optional, for accuracy evaluation only)

Models#

Deploy Kimi K2 Thinking on DGX B200 through Docker#

Prepare Docker image#

Build and run the docker container. See the Docker guide for details.

cd TensorRT-LLM

make -C docker release_build IMAGE_TAG=kimi-k2-thinking-local

make -C docker release_run IMAGE_NAME=tensorrt_llm IMAGE_TAG=kimi-k2-thinking-local LOCAL_USER=1

Launch the TensorRT LLM Server#

This YAML config deploys the model with 8-way expert parallelism for the MoE part and 8-way attention data parallelism. It also enables trust_remote_code, so that it works with the Kimi K2 Thinking customized tokenizer.

With the EXTRA_LLM_API_FILE, use the following example command to launch the TensorRT LLM server with the Kimi-K2-Thinking-NVFP4 model from within the container.

trtllm-serve nvidia/Kimi-K2-Thinking-NVFP4 \
    --host 0.0.0.0 --port 8000 \
    --config ${EXTRA_LLM_API_FILE}

TensorRT LLM will load weights and select the best kernels during startup. The server is successfully launched when the following log is shown:

INFO:     Started server process [xxxxx]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://localhost:8000 (Press CTRL+C to quit)

You can query the health/readiness of the server using:

curl -s -o /dev/null -w "Status: %{http_code}\n" "http://localhost:8000/health"

When the Status: 200 code is returned, the server is ready for queries.

Deploy Kimi K2 Thinking on GB200 NVL72 through SLURM with wide EP and disaggregated serving#

TensorRT LLM provides a set of SLURM scripts that can be easily configured through YAML files and automatically launch SLURM jobs on GB200 NVL72 clusters for deployment, benchmarking, and accuracy testing purposes. The scripts are located at examples/disaggregated/slurm/benchmark. Refer to this page for more details and example wide EP config files.

For Kimi K2 Thinking, an example configuration for SLURM arguments and the scripts is provided at examples/wide_ep/slurm_scripts/kimi-k2-thinking.yaml.

TRTLLM_DIR=/app/tensorrt_llm # change as needed to match your environment
SLURM_CONFIG_FILE=${TRTLLM_DIR}/examples/wide_ep/slurm_scripts/kimi-k2-thinking.yaml

Note: if you don’t have access to the source code locally, you can manually create the YAML config file using the code in the dropdown below.

It includes SLURM-specific configurations, benchmark and hardware details, and environment settings. The worker_config field includes detailed settings for context and generation servers when deploying a disaggregated server, with each specified as a list of LLM API arguments.

To launch SLURM jobs with the YAML config file, execute the following command:

cd <TensorRT LLM root>/examples/disaggregated/slurm/benchmark
python3 submit.py -c ${SLURM_CONFIG_FILE}

Query the OpenAI-compatible API Endpoint#

After the TensorRT LLM server is set up and shows Application startup complete, you can send requests to the server.

curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json"  -d '{
    "model": "nvidia/Kimi-K2-Thinking-NVFP4",
    "messages": [
        {
            "role": "user",
            "content": "Where is New York?"
        }
    ],
    "max_tokens": 128,
    "top_p": 1.0
}' -w "\n"

Example response:

{
  "id": "chatcmpl-5907ed752eb44d11a12893b19f79f8ca",
  "object": "chat.completion",
  "created": 1764866686,
  "model": "nvidia/Kimi-K2-Thinking-NVFP4",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "<think> The user is asking a very simple question: \"Where is New York?\" This could be interpreted in a few ways:\n\n1. Where is New York State located?\n2. Where is New York City located?\n3. Where is New York located in relation to something else?\n\nGiven the ambiguity, I should provide a comprehensive answer that covers the main interpretations. I should be clear and direct.\n\nLet me structure my answer:\n- First, clarify that \"New York\" can refer to either New York State or New York City\n- For New York State: It's located in the northeastern United States, bordered by New Jersey, Pennsylvania, Connecticut",
        "reasoning_content": "",
        "reasoning": null,
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "length",
      "stop_reason": null,
      "mm_embedding_handle": null,
      "disaggregated_params": null,
      "avg_decoded_tokens_per_iter": 1.0
    }
  ],
  "usage": {
    "prompt_tokens": 12,
    "total_tokens": 140,
    "completion_tokens": 128,
    "prompt_tokens_details": {
      "cached_tokens": 0
    }
  },
  "prompt_token_ids": null
}

Benchmark#

To benchmark the performance of your TensorRT LLM server, you can leverage the built-in benchmark_serving.py script. To do this, first create a wrapper bench.sh script.

cat <<'EOF' > bench.sh
#!/usr/bin/env bash
set -euo pipefail

concurrency_list="1 2 4 8 16 32 64 128 256"
multi_round=5
isl=1024
osl=1024
result_dir=/tmp/kimi_k2_thinking_output

for concurrency in ${concurrency_list}; do
    num_prompts=$((concurrency * multi_round))
    python -m tensorrt_llm.serve.scripts.benchmark_serving \
        --model nvidia/Kimi-K2-Thinking-NVFP4 \
        --backend openai \
        --dataset-name "random" \
        --random-input-len ${isl} \
        --random-output-len ${osl} \
        --random-prefix-len 0 \
        --random-ids \
        --num-prompts ${num_prompts} \
        --max-concurrency ${concurrency} \
        --ignore-eos \
        --tokenize-on-client \
        --percentile-metrics "ttft,tpot,itl,e2el"
done
EOF
chmod +x bench.sh

If you want to save the results to a file, add the following options:

--save-result \
--result-dir "${result_dir}" \
--result-filename "concurrency_${concurrency}.json"

For more benchmarking options, see benchmark_serving.py.

Run bench.sh to begin a serving benchmark.

./bench.sh

Troubleshooting#

Since Kimi K2 Thinking has a larger weight size than other models, it is possible to see host OOM issues, such as the following:

Loading weights: 100%|█████████████████████| 1408/1408 [03:43<00:00,  6.30it/s]
 0: [12/04/2025-18:38:28] [TRT-LLM] [RANK 0] [I] moe_load_balancer finalizing model...
 1: [nvl72136-T14:452151:0:452151] Caught signal 7 (Bus error: nonexistent physical address)
 1: ==== backtrace (tid: 452151) ====
 1:  0  /usr/local/ucx//lib/libucs.so.0(ucs_handle_error+0x2cc) [0xffff9638274c]
 1:  1  /usr/local/ucx//lib/libucs.so.0(+0x328fc) [0xffff963828fc]
 1:  2  /usr/local/ucx//lib/libucs.so.0(+0x32c78) [0xffff96382c78]

This can be addressed by mounting tmpfs:/dev/shm:size=640G when launching the Docker container, to increase the shm size that the container can access.