Inference Request

The main class to describe requests to GptManager is InferenceRequest. This is structured as a map of tensors and a uint64_t requestId. The mandatory input tensors to create a valid InferenceRequest object are described below. Sampling config params are documented in the C++ GPT Runtime section. Descriptions have been omitted in the table.

Name

Shape

Type

Description

request_output_len

[1,1]

int32_t

Max number of output tokens

input_ids

[1, num_input_tokens]

int32_t

Tensor of input tokens

Optional tensors that can be supplied to InferenceRequest are shown below. Default values, where applicable are specified.:

Name

Shape

Type

Description

streaming

[1]

bool

(Default=false). When true, stream out tokens as they are generated. When false return only when the full generation has completed.

beam_width

[1]

int32_t

(Default=1) Beam width for this request; set to 1 for greedy sampling

temperature

[1]

float

Sampling Config param: temperature

runtime_top_k

[1]

int32_t

Sampling Config param: topK

runtime_top_p

[1]

float

Sampling Config param: topP

len_penalty

[1]

float

Sampling Config param: lengthPenalty

early_stopping

[1]

int

Sampling Config param: earlyStopping

repetition_penalty

[1]

float

Sampling Config param: repetitionPenalty

min_length

[1]

int32_t

Sampling Config param: minLength

presence_penalty

[1]

float

Sampling Config param: presencePenalty

frequency_penalty

[1]

float

Sampling Config param: frequencyPenalty

no_repeat_ngram_size

[1]

int32_t

Sampling Config param: noRepeatNgramSize

random_seed

[1]

uint64_t

Sampling Config param: randomSeed

end_id

[1]

int32_t

End token Id. If not specified, defaults to -1

pad_id

[1]

int32_t

Pad token Id

embedding_bias

[1, vocab_size]

float

The bias is added to the logits for each token in the vocabulary before decoding occurs. Positive values in the bias encourage the sampling of tokens, while negative values discourage it. A value of 0.f leaves the logit value unchanged.

bad_words_list

[1, 2, num_bad_words]

int32_t

Bad words list. Consider an example with two bad words, where the first word contains tokens [5, 7, 3] and the second one contains tokens [9, 2]. In total there are 5 tokens so the tensor shape should be [1, 2, 5]. The first row of the tensor must contain the token ids, while the second row must store the include-scan offsets of the word lengths (in number of tokens). Hence, the bad_word_list tensor would look like: [[[ 5, 7, 3, 9, 2][ 3, 5, -1, -1, -1]]]

stop_words_list

[1, 2, num_stop_words]

int32_t

Stop words list. See bad_words_list for the description of the expected tensor shape and content

prompt_embedding_table

[1]

float16

P-tuning prompt embedding table

prompt_vocab_size

[1]

int32_t

P-tuning prompt vocab size

lora_task_id

[1]

uint64_t

Task ID for the given lora_weights. This ID is expected to be globally unique. To perform inference with a specific LoRA for the first time lora_task_id lora_weights and lora_config must all be given. The LoRA will be cached, so that subsequent requests for the same task only require lora_task_id. If the cache is full the oldest LoRA will be evicted to make space for new ones. An error is returned if lora_task_id is not cached

lora_weights

[num_lora_modules_layers, D x Hi + Ho x D]

float (model data type)

weights for a LoRA adapter. Refer to Run gpt-2b + LoRA using GptManager / cpp runtime for more information.

lora_config

[num_lora_modules_layers, 3]

int32_t

LoRA configuration tensor. [ module_id, layer_idx, adapter_size (D aka R value) ] Refer to Run gpt-2b + LoRA using GptManager / cpp runtime for more information.

return_log_probs

[1]

bool

When true, include log probs in the output

return_context_logits

[1]

bool

When true, include context logits in the output

return_generation_logits

[1]

bool

When true, include generation logits in the output

draft_input_ids

[num_draft_tokens]

int32_t

Draft tokens to be leveraged in generation phase to potentially generate multiple output tokens in one inflight batching iteration

draft_logits

[num_draft_tokens, vocab_size]

float

Draft logits associated with draft_input_ids to be leveraged in generation phase to potentially generate multiple output tokens in one inflight batching iteration

Responses

Responses from GptManager are formatted as a list of tensors. The table below shows the set of output tensors returned by GptManager (via the SendResponseCallback):

Name

Shape

Type

Description

output_ids

[beam_width, num_output_tokens]

int32_t

Tensor of output tokens. When streaming is enabled, this is a single token.

sequence_length

[beam_width]

int32_t

Number of output tokens. When streaming is set, this will be 1.

output_log_probs

[1, beam_width, num_output_tokens]

float

Only if return_log_probs is set on input. Tensor of log probabilities of output token logits.

cum_log_probs

[1, beam_width]

float

Only if return_log_probs is set on input. Cumulative log probability of the sequence generated.

context_logits

[1, num_input_tokens, vocab_size]

float

Only if return_context_logits is set on input. Tensor of input token logits.

generation_logits

[1, beam_width, num_output_tokens, vocab_size]

float

Only if return_generation_logits is set on input. Tensor of output token logits.