Qwen3 Omni Tts Runtime#

class Qwen3OmniTTSRuntime#

Talker runtime for Qwen3-Omni RVQ code generation.

LLM-based codec encoder that generates RVQ codes from text tokens and hidden states. Manages two LLM engines (Talker + CodePredictor) and MLP projection layers.

Pipeline:

  1. MLP Projection: thinker embed (layer 0) → talker embeddings via text_projection

  2. Talker LLM: generate codec tokens autoregressively

  3. CodePredictor: generate multi-layer codebook codes (Omni: 15, TTS: 31)

  4. Return RVQ codes (vocoding done separately at example layer)

Architecture Philosophy:

  • Talker is an LLM decoder, NOT a multimodal input encoder

  • Similar to LLMInferenceSpecDecodeRuntime, manages multiple LLM engines

  • Standalone runtime, not dependent on MultimodalRunner hierarchy

  • Code2Wav vocoding is separated for better modularity

Public Types

using AudioChunkCallback = std::function<void(std::vector<std::vector<int32_t>> const &chunkRvqCodes)>#

Configuration for Thinker→Talker streaming pipeline.

Public Functions

Qwen3OmniTTSRuntime(
std::string const &talkerEngineDir,
std::string const &codePredictorEngineDir,
std::string const &tokenizerDir,
cudaStream_t stream
)#

Construct and fully initialize the TTS runtime.

Parameters:
  • talkerEngineDir – Directory containing talker engine, MLP weights, embedding table, etc.

  • codePredictorEngineDir – Directory containing code_predictor engine and codec embeddings

  • tokenizerDir – Directory containing tokenizer files. If empty, defaults to talkerEngineDir/../

  • stream – CUDA stream for operations

Throws:

std::runtime_error – on any initialization failure

~Qwen3OmniTTSRuntime()#

Destructor.

inline std::vector<int32_t> getThinkerHiddenLayerIndices() const#

Get required hidden state layer indices from thinker.

Returns:

Vector containing {0} for layer 0 (embed) and {14} for accept_hidden_layer

bool handleAudioGeneration(
std::vector<TalkerGenerationRequest> const &requests,
TalkerGenerationResponse &response,
cudaStream_t stream
)#

Generate audio with RVQ codes (batched)

Implements the complete nested generation loop for a batch of requests:

  • Talker generation loop (autoregressive, batched engine execution)

  • CodePredictor generation (mNumRvqLayers per Talker step, per-batch)

  • Residual connections

  • Sampling at Runtime Layer (batched)

This is the main entry point for audio generation, analogous to LLMInferenceSpecDecodeRuntime::handleRequest() for standard LLM inference.

Note

Sampling parameters (temperature, topK, topP, repetitionPenalty) are taken from requests[0] and applied uniformly to all batches. This matches LLMInferenceSpecDecodeRuntime’s design where SamplingParams is shared across the batch.

Parameters:
  • requests – Batch of requests, each containing per-batch input data

  • response – Response containing generated RVQ codes [batchSize][frames][codes]

  • stream – CUDA stream for execution

Returns:

True if generation succeeded, false otherwise

inline bool handleAudioGeneration(
TalkerGenerationRequest const &request,
TalkerGenerationResponse &response,
cudaStream_t stream
)#

Convenience wrapper for single-request audio generation.

bool handleAudioGenerationFromThinker(
std::vector<OmniGenerationRequest> const &requests,
TalkerGenerationResponse &response,
cudaStream_t stream
)#

Generate audio from external Thinker hidden states (Omni inference path, batched)

Instead of tokenizing text and looking up embeddings internally (TTS path), this API accepts pre-computed Thinker layer-0 hidden states and projects them through the MLP to produce Talker input. Used when integrating with llm_inference.

Note

Sampling parameters (temperature, topK, topP, repetitionPenalty) are taken from requests[0] and applied uniformly to all batches. This matches LLMInferenceSpecDecodeRuntime’s design where SamplingParams is shared across the batch.

Parameters:
  • requests – Batch of requests, each containing per-batch thinker embeddings

  • response – Response containing generated RVQ codes [batchSize][frames][codes]

  • stream – CUDA stream for execution

Returns:

True if generation succeeded, false otherwise

inline bool handleAudioGenerationFromThinker(
OmniGenerationRequest const &request,
TalkerGenerationResponse &response,
cudaStream_t stream
)#

Convenience wrapper for single-request Omni audio generation.

bool handleStreamingGeneration(
LLMInferenceSpecDecodeRuntime &thinkerRuntime,
LLMGenerationRequest &thinkerRequest,
LLMGenerationResponse &thinkerResponse,
ThinkerTalkerStreamingConfig const &streamingConfig,
OmniGenerationRequest const &omniBaseRequest,
TalkerGenerationResponse &talkerResponse,
cudaStream_t stream
)#

Streaming generation: Thinker and Talker run interleaved on the same CUDA stream.

Uses LLMGenerationRequest::onTokenGenerated to receive per-token callbacks from the Thinker’s decode loop. When enough assistant tokens accumulate, Talker prefill is triggered. Subsequent Thinker tokens incrementally extend trailing_text_hidden, and Talker decode steps are interleaved.

Parameters:
  • thinkerRuntime – Thinker LLM runtime (will call handleRequest internally)

  • thinkerRequest – Thinker request (onTokenGenerated will be overwritten)

  • streamingConfig – Pipeline tuning parameters

  • talkerResponse – Output: generated RVQ codes

  • stream – CUDA stream (shared by Thinker and Talker)

Returns:

True if the full pipeline succeeded

inline metrics::MultimodalMetrics const &getMetrics() const#

Get performance metrics for Talker pipeline (legacy, for backward compat)

Returns:

Reference to metrics object

inline metrics::OmniTalkerMetrics const &getOmniTalkerMetrics(
) const#

Get Omni-specific Talker metrics (frames, RVQ codes, prefill time, exit reason)

inline metrics::OmniLatencyMetrics const &getOmniLatencyMetrics(
) const#

Get Omni audio latency metrics (TTFA, RTF, E2E)

inline metrics::OmniLatencyMetrics &getMutableOmniLatencyMetrics()#

Get mutable reference to latency metrics (for E2E timing set at example layer)

inline cudaEvent_t getTtfaEndEvent() const#

Get the TTFA end event (first codec token sampled) for external timing.

bool captureDecodingCUDAGraph(cudaStream_t stream)#

Capture CUDA graphs for decoding steps (same pattern as LLMInferenceSpecDecodeRuntime).

Parameters:

stream – CUDA stream for capture

Returns:

True if all graphs captured successfully

int32_t getSpeakerIdByName(std::string const &speakerName) const#

Get speaker ID by name.

Parameters:

speakerName – Speaker name (e.g., “f245”, “m02”)

Returns:

Speaker ID, or default speaker ID if not found

struct OmniGenerationRequest#

Request structure for Omni inference (Thinker output as input)

Non-streaming: provide fullText (formatted prompt + generated text), which will be tokenized internally to reconstruct layer-0 embeddings via the Thinker embedding table.

Public Members

std::string fullText#

Complete formatted text (if textTokenIds empty, tokenized internally)

std::vector<int32_t> textTokenIds#

Full token sequence: inputTokenIds + outputIds (including EOS)

Non-owning pointer to this batch’s prefill layer-0 embeddings (with multimodal features). Must point to a [1, prefillLength, thinkerHiddenSize] FP16 (GPU) view for this batch. Caller slices from the full [BS, prefillLen, H] tensor. Generated token embeddings are reconstructed from the TTS embedding table internally.

rt::Tensor const *thinkerPrefillEmbeds = {nullptr}#

Non-owning pointer to this batch’s layer-14 hidden states (prefill only). Must point to a [1, prefillLength, thinkerHiddenSize] FP16 (GPU) view for this batch. Only user-segment multimodal token positions are read.

rt::Tensor const *thinkerHiddenStates = {nullptr}#
int32_t prefillLength = {0}#

Number of prefill tokens (layer0/layer14 cover [0, prefillLength))

int32_t maxAudioLength = {4096}#
float talkerTemperature = {0}#
int32_t talkerTopK = {0}#
float talkerTopP = {0}#
float repetitionPenalty = {1.05f}#
std::string speakerName = {""}#
int32_t speakerId = {-1}#
struct TalkerGenerationRequest#

Talker audio generation request structure.

Contains sampling parameters and input data for audio generation. Sampling parameters are provided per-request (not from config.json).

Public Members

int32_t maxAudioLength = {4096}#

Maximum number of audio codec tokens to generate.

float talkerTemperature = {0}#

Talker temperature (0 = default 0.9)

int32_t talkerTopK = {0}#

Talker top-K (0 = default 50)

float talkerTopP = {0}#

Talker top-P (0 = default 1.0)

float repetitionPenalty = {1.05f}#

Repetition penalty applied to seen codec tokens (1.0 = disabled)

std::string speakerName = {""}#

Speaker name (e.g., “f245”, “m02”) - empty means use default.

int32_t speakerId = {-1}#

Speaker ID - if >= 0, overrides speakerName.

std::vector<Message> messages#
bool applyChatTemplate = {true}#

Whether to apply chat template formatting.

bool addGenerationPrompt = {true}#

Whether to add generation prompt at the end.

bool enableThinking = {false}#

Whether to enable thinking mode.

struct TalkerGenerationResponse#

Talker audio generation response structure.

Contains generated RVQ codes and metadata.

Public Members

std::vector<std::vector<std::vector<int32_t>>> batchRvqCodes#
std::vector<int32_t> numFramesPerSample#

Number of audio frames generated per batch sample.

bool success = {false}#

Whether generation succeeded.

struct ThinkerTalkerStreamingConfig#

Public Members

int32_t talkerPrefillThreshold = {4}#

Start Talker prefill after this many assistant tokens.

int32_t codecChunkFrames = {0}#

Vocode every N frames during flush (0 = disabled)

AudioChunkCallback onAudioChunkReady#

Called with chunk RVQ codes [frames][16] when ready.

struct TalkerGenerationRequest

Talker audio generation request structure.

Contains sampling parameters and input data for audio generation. Sampling parameters are provided per-request (not from config.json).

Public Members

int32_t maxAudioLength = {4096}

Maximum number of audio codec tokens to generate.

float talkerTemperature = {0}

Talker temperature (0 = default 0.9)

int32_t talkerTopK = {0}

Talker top-K (0 = default 50)

float talkerTopP = {0}

Talker top-P (0 = default 1.0)

float repetitionPenalty = {1.05f}

Repetition penalty applied to seen codec tokens (1.0 = disabled)

std::string speakerName = {""}

Speaker name (e.g., “f245”, “m02”) - empty means use default.

int32_t speakerId = {-1}

Speaker ID - if >= 0, overrides speakerName.

std::vector<Message> messages
bool applyChatTemplate = {true}

Whether to apply chat template formatting.

bool addGenerationPrompt = {true}

Whether to add generation prompt at the end.

bool enableThinking = {false}

Whether to enable thinking mode.

struct TalkerGenerationResponse

Talker audio generation response structure.

Contains generated RVQ codes and metadata.

Public Members

std::vector<std::vector<std::vector<int32_t>>> batchRvqCodes
std::vector<int32_t> numFramesPerSample

Number of audio frames generated per batch sample.

bool success = {false}

Whether generation succeeded.

struct OmniGenerationRequest

Request structure for Omni inference (Thinker output as input)

Non-streaming: provide fullText (formatted prompt + generated text), which will be tokenized internally to reconstruct layer-0 embeddings via the Thinker embedding table.

Public Members

std::string fullText

Complete formatted text (if textTokenIds empty, tokenized internally)

std::vector<int32_t> textTokenIds

Full token sequence: inputTokenIds + outputIds (including EOS)

Non-owning pointer to this batch’s prefill layer-0 embeddings (with multimodal features). Must point to a [1, prefillLength, thinkerHiddenSize] FP16 (GPU) view for this batch. Caller slices from the full [BS, prefillLen, H] tensor. Generated token embeddings are reconstructed from the TTS embedding table internally.

rt::Tensor const *thinkerPrefillEmbeds = {nullptr}

Non-owning pointer to this batch’s layer-14 hidden states (prefill only). Must point to a [1, prefillLength, thinkerHiddenSize] FP16 (GPU) view for this batch. Only user-segment multimodal token positions are read.

rt::Tensor const *thinkerHiddenStates = {nullptr}
int32_t prefillLength = {0}

Number of prefill tokens (layer0/layer14 cover [0, prefillLength))

int32_t maxAudioLength = {4096}
float talkerTemperature = {0}
int32_t talkerTopK = {0}
float talkerTopP = {0}
float repetitionPenalty = {1.05f}
std::string speakerName = {""}
int32_t speakerId = {-1}
struct ThinkerTalkerStreamingConfig

Public Members

int32_t talkerPrefillThreshold = {4}

Start Talker prefill after this many assistant tokens.

int32_t codecChunkFrames = {0}

Vocode every N frames during flush (0 = disabled)

AudioChunkCallback onAudioChunkReady

Called with chunk RVQ codes [frames][16] when ready.