Qwen3 Omni Tts Runtime#
-
class Qwen3OmniTTSRuntime#
Talker runtime for Qwen3-Omni RVQ code generation.
LLM-based codec encoder that generates RVQ codes from text tokens and hidden states. Manages two LLM engines (Talker + CodePredictor) and MLP projection layers.
Pipeline:
MLP Projection: thinker embed (layer 0) → talker embeddings via text_projection
Talker LLM: generate codec tokens autoregressively
CodePredictor: generate multi-layer codebook codes (Omni: 15, TTS: 31)
Return RVQ codes (vocoding done separately at example layer)
Architecture Philosophy:
Talker is an LLM decoder, NOT a multimodal input encoder
Similar to LLMInferenceSpecDecodeRuntime, manages multiple LLM engines
Standalone runtime, not dependent on MultimodalRunner hierarchy
Code2Wav vocoding is separated for better modularity
Public Types
-
using AudioChunkCallback = std::function<void(std::vector<std::vector<int32_t>> const &chunkRvqCodes)>#
Configuration for Thinker→Talker streaming pipeline.
Public Functions
- Qwen3OmniTTSRuntime(
- std::string const &talkerEngineDir,
- std::string const &codePredictorEngineDir,
- std::string const &tokenizerDir,
- cudaStream_t stream
Construct and fully initialize the TTS runtime.
- Parameters:
talkerEngineDir – Directory containing talker engine, MLP weights, embedding table, etc.
codePredictorEngineDir – Directory containing code_predictor engine and codec embeddings
tokenizerDir – Directory containing tokenizer files. If empty, defaults to talkerEngineDir/../
stream – CUDA stream for operations
- Throws:
std::runtime_error – on any initialization failure
-
~Qwen3OmniTTSRuntime()#
Destructor.
-
inline std::vector<int32_t> getThinkerHiddenLayerIndices() const#
Get required hidden state layer indices from thinker.
- Returns:
Vector containing {0} for layer 0 (embed) and {14} for accept_hidden_layer
- bool handleAudioGeneration(
- std::vector<TalkerGenerationRequest> const &requests,
- TalkerGenerationResponse &response,
- cudaStream_t stream
Generate audio with RVQ codes (batched)
Implements the complete nested generation loop for a batch of requests:
Talker generation loop (autoregressive, batched engine execution)
CodePredictor generation (mNumRvqLayers per Talker step, per-batch)
Residual connections
Sampling at Runtime Layer (batched)
This is the main entry point for audio generation, analogous to LLMInferenceSpecDecodeRuntime::handleRequest() for standard LLM inference.
Note
Sampling parameters (temperature, topK, topP, repetitionPenalty) are taken from requests[0] and applied uniformly to all batches. This matches LLMInferenceSpecDecodeRuntime’s design where SamplingParams is shared across the batch.
- Parameters:
requests – Batch of requests, each containing per-batch input data
response – Response containing generated RVQ codes [batchSize][frames][codes]
stream – CUDA stream for execution
- Returns:
True if generation succeeded, false otherwise
- inline bool handleAudioGeneration(
- TalkerGenerationRequest const &request,
- TalkerGenerationResponse &response,
- cudaStream_t stream
Convenience wrapper for single-request audio generation.
- bool handleAudioGenerationFromThinker(
- std::vector<OmniGenerationRequest> const &requests,
- TalkerGenerationResponse &response,
- cudaStream_t stream
Generate audio from external Thinker hidden states (Omni inference path, batched)
Instead of tokenizing text and looking up embeddings internally (TTS path), this API accepts pre-computed Thinker layer-0 hidden states and projects them through the MLP to produce Talker input. Used when integrating with llm_inference.
Note
Sampling parameters (temperature, topK, topP, repetitionPenalty) are taken from requests[0] and applied uniformly to all batches. This matches LLMInferenceSpecDecodeRuntime’s design where SamplingParams is shared across the batch.
- Parameters:
requests – Batch of requests, each containing per-batch thinker embeddings
response – Response containing generated RVQ codes [batchSize][frames][codes]
stream – CUDA stream for execution
- Returns:
True if generation succeeded, false otherwise
- inline bool handleAudioGenerationFromThinker(
- OmniGenerationRequest const &request,
- TalkerGenerationResponse &response,
- cudaStream_t stream
Convenience wrapper for single-request Omni audio generation.
- bool handleStreamingGeneration(
- LLMInferenceSpecDecodeRuntime &thinkerRuntime,
- LLMGenerationRequest &thinkerRequest,
- LLMGenerationResponse &thinkerResponse,
- ThinkerTalkerStreamingConfig const &streamingConfig,
- OmniGenerationRequest const &omniBaseRequest,
- TalkerGenerationResponse &talkerResponse,
- cudaStream_t stream
Streaming generation: Thinker and Talker run interleaved on the same CUDA stream.
Uses LLMGenerationRequest::onTokenGenerated to receive per-token callbacks from the Thinker’s decode loop. When enough assistant tokens accumulate, Talker prefill is triggered. Subsequent Thinker tokens incrementally extend trailing_text_hidden, and Talker decode steps are interleaved.
- Parameters:
thinkerRuntime – Thinker LLM runtime (will call handleRequest internally)
thinkerRequest – Thinker request (onTokenGenerated will be overwritten)
streamingConfig – Pipeline tuning parameters
talkerResponse – Output: generated RVQ codes
stream – CUDA stream (shared by Thinker and Talker)
- Returns:
True if the full pipeline succeeded
-
inline metrics::MultimodalMetrics const &getMetrics() const#
Get performance metrics for Talker pipeline (legacy, for backward compat)
- Returns:
Reference to metrics object
- inline metrics::OmniTalkerMetrics const &getOmniTalkerMetrics(
Get Omni-specific Talker metrics (frames, RVQ codes, prefill time, exit reason)
- inline metrics::OmniLatencyMetrics const &getOmniLatencyMetrics(
Get Omni audio latency metrics (TTFA, RTF, E2E)
-
inline metrics::OmniLatencyMetrics &getMutableOmniLatencyMetrics()#
Get mutable reference to latency metrics (for E2E timing set at example layer)
-
inline cudaEvent_t getTtfaEndEvent() const#
Get the TTFA end event (first codec token sampled) for external timing.
-
bool captureDecodingCUDAGraph(cudaStream_t stream)#
Capture CUDA graphs for decoding steps (same pattern as LLMInferenceSpecDecodeRuntime).
- Parameters:
stream – CUDA stream for capture
- Returns:
True if all graphs captured successfully
-
int32_t getSpeakerIdByName(std::string const &speakerName) const#
Get speaker ID by name.
- Parameters:
speakerName – Speaker name (e.g., “f245”, “m02”)
- Returns:
Speaker ID, or default speaker ID if not found
-
struct OmniGenerationRequest#
Request structure for Omni inference (Thinker output as input)
Non-streaming: provide fullText (formatted prompt + generated text), which will be tokenized internally to reconstruct layer-0 embeddings via the Thinker embedding table.
Public Members
-
std::string fullText#
Complete formatted text (if textTokenIds empty, tokenized internally)
-
std::vector<int32_t> textTokenIds#
Full token sequence: inputTokenIds + outputIds (including EOS)
Non-owning pointer to this batch’s prefill layer-0 embeddings (with multimodal features). Must point to a [1, prefillLength, thinkerHiddenSize] FP16 (GPU) view for this batch. Caller slices from the full [BS, prefillLen, H] tensor. Generated token embeddings are reconstructed from the TTS embedding table internally.
-
rt::Tensor const *thinkerPrefillEmbeds = {nullptr}#
Non-owning pointer to this batch’s layer-14 hidden states (prefill only). Must point to a [1, prefillLength, thinkerHiddenSize] FP16 (GPU) view for this batch. Only user-segment multimodal token positions are read.
-
int32_t prefillLength = {0}#
Number of prefill tokens (layer0/layer14 cover [0, prefillLength))
-
int32_t maxAudioLength = {4096}#
-
float talkerTemperature = {0}#
-
int32_t talkerTopK = {0}#
-
float talkerTopP = {0}#
-
float repetitionPenalty = {1.05f}#
-
std::string speakerName = {""}#
-
int32_t speakerId = {-1}#
-
std::string fullText#
-
struct TalkerGenerationRequest#
Talker audio generation request structure.
Contains sampling parameters and input data for audio generation. Sampling parameters are provided per-request (not from config.json).
Public Members
-
int32_t maxAudioLength = {4096}#
Maximum number of audio codec tokens to generate.
-
float talkerTemperature = {0}#
Talker temperature (0 = default 0.9)
-
int32_t talkerTopK = {0}#
Talker top-K (0 = default 50)
-
float talkerTopP = {0}#
Talker top-P (0 = default 1.0)
-
float repetitionPenalty = {1.05f}#
Repetition penalty applied to seen codec tokens (1.0 = disabled)
-
std::string speakerName = {""}#
Speaker name (e.g., “f245”, “m02”) - empty means use default.
-
int32_t speakerId = {-1}#
Speaker ID - if >= 0, overrides speakerName.
-
bool applyChatTemplate = {true}#
Whether to apply chat template formatting.
-
bool addGenerationPrompt = {true}#
Whether to add generation prompt at the end.
-
bool enableThinking = {false}#
Whether to enable thinking mode.
-
int32_t maxAudioLength = {4096}#
-
struct TalkerGenerationResponse#
Talker audio generation response structure.
Contains generated RVQ codes and metadata.
-
struct ThinkerTalkerStreamingConfig#
Public Members
-
int32_t talkerPrefillThreshold = {4}#
Start Talker prefill after this many assistant tokens.
-
int32_t codecChunkFrames = {0}#
Vocode every N frames during flush (0 = disabled)
-
AudioChunkCallback onAudioChunkReady#
Called with chunk RVQ codes [frames][16] when ready.
-
int32_t talkerPrefillThreshold = {4}#
-
struct TalkerGenerationRequest
Talker audio generation request structure.
Contains sampling parameters and input data for audio generation. Sampling parameters are provided per-request (not from config.json).
Public Members
-
int32_t maxAudioLength = {4096}
Maximum number of audio codec tokens to generate.
-
float talkerTemperature = {0}
Talker temperature (0 = default 0.9)
-
int32_t talkerTopK = {0}
Talker top-K (0 = default 50)
-
float talkerTopP = {0}
Talker top-P (0 = default 1.0)
-
float repetitionPenalty = {1.05f}
Repetition penalty applied to seen codec tokens (1.0 = disabled)
-
std::string speakerName = {""}
Speaker name (e.g., “f245”, “m02”) - empty means use default.
-
int32_t speakerId = {-1}
Speaker ID - if >= 0, overrides speakerName.
-
std::vector<Message> messages
-
bool applyChatTemplate = {true}
Whether to apply chat template formatting.
-
bool addGenerationPrompt = {true}
Whether to add generation prompt at the end.
-
bool enableThinking = {false}
Whether to enable thinking mode.
-
int32_t maxAudioLength = {4096}
-
struct TalkerGenerationResponse
Talker audio generation response structure.
Contains generated RVQ codes and metadata.
Public Members
-
std::vector<std::vector<std::vector<int32_t>>> batchRvqCodes
-
std::vector<int32_t> numFramesPerSample
Number of audio frames generated per batch sample.
-
bool success = {false}
Whether generation succeeded.
-
std::vector<std::vector<std::vector<int32_t>>> batchRvqCodes
-
struct OmniGenerationRequest
Request structure for Omni inference (Thinker output as input)
Non-streaming: provide fullText (formatted prompt + generated text), which will be tokenized internally to reconstruct layer-0 embeddings via the Thinker embedding table.
Public Members
-
std::string fullText
Complete formatted text (if textTokenIds empty, tokenized internally)
-
std::vector<int32_t> textTokenIds
Full token sequence: inputTokenIds + outputIds (including EOS)
Non-owning pointer to this batch’s prefill layer-0 embeddings (with multimodal features). Must point to a [1, prefillLength, thinkerHiddenSize] FP16 (GPU) view for this batch. Caller slices from the full [BS, prefillLen, H] tensor. Generated token embeddings are reconstructed from the TTS embedding table internally.
-
rt::Tensor const *thinkerPrefillEmbeds = {nullptr}
Non-owning pointer to this batch’s layer-14 hidden states (prefill only). Must point to a [1, prefillLength, thinkerHiddenSize] FP16 (GPU) view for this batch. Only user-segment multimodal token positions are read.
-
int32_t prefillLength = {0}
Number of prefill tokens (layer0/layer14 cover [0, prefillLength))
-
int32_t maxAudioLength = {4096}
-
float talkerTemperature = {0}
-
int32_t talkerTopK = {0}
-
float talkerTopP = {0}
-
float repetitionPenalty = {1.05f}
-
std::string speakerName = {""}
-
int32_t speakerId = {-1}
-
std::string fullText
-
struct ThinkerTalkerStreamingConfig
Public Members
-
int32_t talkerPrefillThreshold = {4}
Start Talker prefill after this many assistant tokens.
-
int32_t codecChunkFrames = {0}
Vocode every N frames during flush (0 = disabled)
-
AudioChunkCallback onAudioChunkReady
Called with chunk RVQ codes [frames][16] when ready.
-
int32_t talkerPrefillThreshold = {4}