trtllm-serve

trtllm-serve

Running an OpenAI API compatible server

MODEL: model name | HF checkpoint path | TensorRT engine path

trtllm-serve [OPTIONS] MODEL

Options

--tokenizer <tokenizer>

Path | Name of the tokenizer.Specify this value only if using TensorRT engine as model.

--host <host>

Hostname of the server.

--port <port>

Port of the server.

--max_beam_width <max_beam_width>

Maximum number of beams for beam search decoding.

--max_batch_size <max_batch_size>

Maximum number of requests that the engine can schedule.

--max_num_tokens <max_num_tokens>

Maximum number of batched input tokens after padding is removed in each batch.

--max_seq_len <max_seq_len>

Maximum total length of one request, including prompt and outputs. If unspecified, the value is deduced from the model config.

--tp_size <tp_size>

Tensor parallelism size.

--pp_size <pp_size>

Pipeline parallelism size.

--kv_cache_free_gpu_memory_fraction <kv_cache_free_gpu_memory_fraction>

Free GPU memory fraction reserved for KV Cache, after allocating model weights and buffers.

--trust_remote_code

Flag for HF transformers.

Arguments

MODEL

Required argument