Inference Request#

The main class to describe requests to GptManager is InferenceRequest. This is structured as a map of tensors and a uint64_t requestId. The mandatory input tensors to create a valid InferenceRequest object are described below. Sampling config params are documented in the C++ GPT Runtime section. Descriptions have been omitted in the table.

Name	Shape	Type	Description
`request_output_len`	[1,1]	`int32_t`	Max number of output tokens
`input_ids`	[1, num_input_tokens]	`int32_t`	Tensor of input tokens

Optional tensors that can be supplied to InferenceRequest are shown below. Default values, where applicable are specified.:

Name	Shape	Type	Description
`streaming`	[1]	`bool`	(Default=`false`). When `true`, stream out tokens as they are generated. When `false` return only when the full generation has completed.
`beam_width`	[1]	`int32_t`	(Default=1) Beam width for this request; set to 1 for greedy sampling
`temperature`	[1]	`float`	Sampling Config param: `temperature`
`runtime_top_k`	[1]	`int32_t`	Sampling Config param: `topK`
`runtime_top_p`	[1]	`float`	Sampling Config param: `topP`
`len_penalty`	[1]	`float`	Sampling Config param: `lengthPenalty`
`early_stopping`	[1]	`int`	Sampling Config param: `earlyStopping`
`repetition_penalty`	[1]	`float`	Sampling Config param: `repetitionPenalty`
`min_length`	[1]	`int32_t`	Sampling Config param: `minLength`
`presence_penalty`	[1]	`float`	Sampling Config param: `presencePenalty`
`frequency_penalty`	[1]	`float`	Sampling Config param: `frequencyPenalty`
`no_repeat_ngram_size`	[1]	`int32_t`	Sampling Config param: `noRepeatNgramSize`
`random_seed`	[1]	`uint64_t`	Sampling Config param: `randomSeed`
`end_id`	[1]	`int32_t`	End token Id. If not specified, defaults to -1
`pad_id`	[1]	`int32_t`	Pad token Id
`embedding_bias`	[1, vocab_size]	`float`	The bias is added to the logits for each token in the vocabulary before decoding occurs. Positive values in the bias encourage the sampling of tokens, while negative values discourage it. A value of `0.f` leaves the logit value unchanged.
`bad_words_list`	[1, 2, num_bad_words]	`int32_t`	Bad words list. Consider an example with two bad words, where the first word contains tokens `[5, 7, 3]` and the second one contains tokens `[9, 2]`. In total there are 5 tokens so the tensor shape should be `[1, 2, 5]`. The first row of the tensor must contain the token ids, while the second row must store the include-scan offsets of the word lengths (in number of tokens). Hence, the `bad_word_list` tensor would look like: `[[[ 5, 7, 3, 9, 2][ 3, 5, -1, -1, -1]]]`
`stop_words_list`	[1, 2, num_stop_words]	`int32_t`	Stop words list. See `bad_words_list` for the description of the expected tensor shape and content
`prompt_embedding_table`	[1]	`float16`	P-tuning prompt embedding table
`prompt_vocab_size`	[1]	`int32_t`	P-tuning prompt vocab size
`lora_task_id`	[1]	`uint64_t`	Task ID for the given lora_weights. This ID is expected to be globally unique. To perform inference with a specific LoRA for the first time `lora_task_id` `lora_weights` and `lora_config` must all be given. The LoRA will be cached, so that subsequent requests for the same task only require `lora_task_id`. If the cache is full the oldest LoRA will be evicted to make space for new ones. An error is returned if `lora_task_id` is not cached
`lora_weights`	[num_lora_modules_layers, D x Hi + Ho x D]	`float` (model data type)	weights for a LoRA adapter. Refer to Run gpt-2b + LoRA using GptManager / cpp runtime for more information.
`lora_config`	[num_lora_modules_layers, 3]	`int32_t`	LoRA configuration tensor. `[ module_id, layer_idx, adapter_size (D aka R value) ]` Refer to Run gpt-2b + LoRA using GptManager / cpp runtime for more information.
`return_log_probs`	[1]	`bool`	When `true`, include log probs in the output
`return_context_logits`	[1]	`bool`	When `true`, include context logits in the output
`return_generation_logits`	[1]	`bool`	When `true`, include generation logits in the output
`draft_input_ids`	[num_draft_tokens]	`int32_t`	Draft tokens to be leveraged in generation phase to potentially generate multiple output tokens in one inflight batching iteration
`draft_logits`	[num_draft_tokens, vocab_size]	`float`	Draft logits associated with `draft_input_ids` to be leveraged in generation phase to potentially generate multiple output tokens in one inflight batching iteration

Responses#

Responses from GptManager are formatted as a list of tensors. The table below shows the set of output tensors returned by GptManager (via the SendResponseCallback):

Name	Shape	Type	Description
`output_ids`	[beam_width, num_output_tokens]	`int32_t`	Tensor of output tokens. When `streaming` is enabled, this is a single token.
`sequence_length`	[beam_width]	`int32_t`	Number of output tokens. When `streaming` is set, this will be 1.
`output_log_probs`	[1, beam_width, num_output_tokens]	`float`	Only if `return_log_probs` is set on input. Tensor of log probabilities of output token logits.
`cum_log_probs`	[1, beam_width]	`float`	Only if `return_log_probs` is set on input. Cumulative log probability of the sequence generated.
`context_logits`	[1, num_input_tokens, vocab_size]	`float`	Only if `return_context_logits` is set on input. Tensor of input token logits.
`generation_logits`	[1, beam_width, num_output_tokens, vocab_size]	`float`	Only if `return_generation_logits` is set on input. Tensor of output token logits.