trtllm-serve
trtllm-serve
Running an OpenAI API compatible server
MODEL: model name | HF checkpoint path | TensorRT engine path
trtllm-serve [OPTIONS] MODEL
Options
- --tokenizer <tokenizer>
Path | Name of the tokenizer.Specify this value only if using TensorRT engine as model.
- --host <host>
Hostname of the server.
- --port <port>
Port of the server.
- --max_beam_width <max_beam_width>
Maximum number of beams for beam search decoding.
- --max_batch_size <max_batch_size>
Maximum number of requests that the engine can schedule.
- --max_num_tokens <max_num_tokens>
Maximum number of batched input tokens after padding is removed in each batch.
- --max_seq_len <max_seq_len>
Maximum total length of one request, including prompt and outputs. If unspecified, the value is deduced from the model config.
- --tp_size <tp_size>
Tensor parallelism size.
- --pp_size <pp_size>
Pipeline parallelism size.
- --kv_cache_free_gpu_memory_fraction <kv_cache_free_gpu_memory_fraction>
Free GPU memory fraction reserved for KV Cache, after allocating model weights and buffers.
- --trust_remote_code
Flag for HF transformers.
Arguments
- MODEL
Required argument