Inference Request
The main class to describe requests to GptManager
is InferenceRequest
. This is structured as a map of tensors and a uint64_t requestId
.
The mandatory input tensors to create a valid InferenceRequest
object are described below. Sampling config params are documented in the C++ GPT Runtime section. Descriptions have been omitted in the table.
Name |
Shape |
Type |
Description |
---|---|---|---|
|
[1,1] |
|
Max number of output tokens |
|
[1, num_input_tokens] |
|
Tensor of input tokens |
Optional tensors that can be supplied to InferenceRequest
are shown below. Default values, where applicable are specified.:
Name |
Shape |
Type |
Description |
---|---|---|---|
|
[1] |
|
(Default= |
|
[1] |
|
(Default=1) Beam width for this request; set to 1 for greedy sampling |
|
[1] |
|
Sampling Config param: |
|
[1] |
|
Sampling Config param: |
|
[1] |
|
Sampling Config param: |
|
[1] |
|
Sampling Config param: |
|
[1] |
|
Sampling Config param: |
|
[1] |
|
Sampling Config param: |
|
[1] |
|
Sampling Config param: |
|
[1] |
|
Sampling Config param: |
|
[1] |
|
Sampling Config param: |
|
[1] |
|
Sampling Config param: |
|
[1] |
|
Sampling Config param: |
|
[1] |
|
End token Id. If not specified, defaults to -1 |
|
[1] |
|
Pad token Id |
|
[1, vocab_size] |
|
The bias is added to the logits for each token in the vocabulary before decoding occurs. Positive values in the bias encourage the sampling of tokens, while negative values discourage it. A value of |
|
[1, 2, num_bad_words] |
|
Bad words list. Consider an example with two bad words, where the first word contains tokens |
|
[1, 2, num_stop_words] |
|
Stop words list. See |
|
[1] |
|
P-tuning prompt embedding table |
|
[1] |
|
P-tuning prompt vocab size |
|
[1] |
|
Task ID for the given lora_weights. This ID is expected to be globally unique. To perform inference with a specific LoRA for the first time |
|
[num_lora_modules_layers, D x Hi + Ho x D] |
|
weights for a LoRA adapter. Refer to Run gpt-2b + LoRA using GptManager / cpp runtime for more information. |
|
[num_lora_modules_layers, 3] |
|
LoRA configuration tensor. |
|
[1] |
|
When |
|
[1] |
|
When |
|
[1] |
|
When |
|
[num_draft_tokens] |
|
Draft tokens to be leveraged in generation phase to potentially generate multiple output tokens in one inflight batching iteration |
|
[num_draft_tokens, vocab_size] |
|
Draft logits associated with |
Responses
Responses from GptManager are formatted as a list of tensors. The table below shows the set of output tensors returned by GptManager
(via the SendResponseCallback
):
Name |
Shape |
Type |
Description |
---|---|---|---|
|
[beam_width, num_output_tokens] |
|
Tensor of output tokens. When |
|
[beam_width] |
|
Number of output tokens. When |
|
[1, beam_width, num_output_tokens] |
|
Only if |
|
[1, beam_width] |
|
Only if |
|
[1, num_input_tokens, vocab_size] |
|
Only if |
|
[1, beam_width, num_output_tokens, vocab_size] |
|
Only if |