Decoding Inference Context#

struct BatchResult#

Batch result data for a single sequence.

Encapsulates all data needed to track a batch’s execution results, whether the sequence is still active or has already been evicted.

Public Members

std::vector<int32_t> tokenIds#

Generated token IDs.

std::vector<int32_t> rawBatchedInputIds#

Original input token IDs.

int32_t generateLength = {0}#

Number of tokens generated.

int32_t actualIterations = {0}#

Number of iterations executed.

int32_t effectivePrefillLength = {0}#

Effective prefill length after system prompt cache reuse.

FinishReason terminalReason{FinishReason::kNotFinished}#

Why this batch terminated (EOS, length, stop string, cancel, error)

struct DecodingInferenceContext#

Per-request execution context shared by runtime and decoding strategies.

Holds request-local sequence metadata, sampling parameters, multimodal embedding references, streaming state, and batch-eviction bookkeeping.

Public Functions

void initialize(
int32_t batchSize,
int32_t maxGenLength,
rt::OptionalInputTensor const &visual,
rt::OptionalInputTensors const &deepstackFeatures,
std::string const &loraName,
cudaStream_t cudaStream
)#

Initialize request-local vectors and scalar fields.

Parameters:
  • batchSize – Active batch size

  • maxGenLength – Maximum generation length

  • visual – Optional visual embeddings

  • deepstackFeatures – Deepstack features for Qwen3-VL

  • loraName – LoRA weights name used by this request

  • cudaStream – CUDA stream for operations

Public Members

std::vector<std::string> systemPrompts#

System prompts for each sequence in batch.

std::vector<std::vector<int32_t>> rawBatchedInputIds#

Original token IDs before preprocessing.

std::vector<std::vector<int32_t>> tokenIds#

Token IDs for each sequence: [batch_size][seq_length].

std::vector<int32_t> currentGenerateLengths#

Current generation length for each sequence.

std::vector<int32_t> effectivePrefillLengths#

Prefill length after system prompt cache reuse.

std::vector<int8_t> finishedStates#

Finished state for each sequence.

std::unordered_map<int32_t, BatchResult> completedBatches#

Results of completed batches.

std::vector<int32_t> batchIndexMapping#

Maps current batch index to original index.

std::vector<SlotStreamState> slotStreams#

Per-slot streaming state.

rt::OptionalInputTensor visualEmbeddings#

Optional visual embeddings.

rt::OptionalInputTensor audioEmbeddings#

Optional audio embeddings.

rt::OptionalInputTensors deepstackFeatures#

Optional Deepstack features.

int32_t generationRound = {}#

Current generation round.

int32_t maxGenerateLength = {}#

Maximum generation length.

int32_t activeBatchSize = {}#

Current active batch size.

std::string loraWeightsName = {""}#

LoRA adapter name used by this request.

cudaStream_t stream = {}#

CUDA stream.

float temperature = {1.0f}#

Temperature for sampling.

float topP = {1.0f}#

Top-P sampling parameter.

int64_t topK = {0}#

Top-K sampling parameter.

std::vector<std::vector<std::string>> stopStringsPerSlot#
bool outputThinkerEmbeddings = {false}#

Whether to capture hidden states for the Talker pipeline.

std::optional<TokenCallback> onTokenGenerated#

Optional per-token callback invoked after each accepted token update.