trtllm-serve#
trtllm-serve#
Running an OpenAI API compatible server
MODEL: model name | HF checkpoint path | TensorRT engine path
trtllm-serve [OPTIONS] MODEL
Options
- --tokenizer <tokenizer>#
Path | Name of the tokenizer.Specify this value only if using TensorRT engine as model.
- --host <host>#
Hostname of the server.
- --port <port>#
Port of the server.
- --backend <backend>#
Set to ‘pytorch’ for pytorch path. Default is cpp path.
- Options:
pytorch
- --max_beam_width <max_beam_width>#
Maximum number of beams for beam search decoding.
- --max_batch_size <max_batch_size>#
Maximum number of requests that the engine can schedule.
- --max_num_tokens <max_num_tokens>#
Maximum number of batched input tokens after padding is removed in each batch.
- --max_seq_len <max_seq_len>#
Maximum total length of one request, including prompt and outputs. If unspecified, the value is deduced from the model config.
- --tp_size <tp_size>#
Tensor parallelism size.
- --pp_size <pp_size>#
Pipeline parallelism size.
- --kv_cache_free_gpu_memory_fraction <kv_cache_free_gpu_memory_fraction>#
Free GPU memory fraction reserved for KV Cache, after allocating model weights and buffers.
- --trust_remote_code#
Flag for HF transformers.
Arguments
- MODEL#
Required argument