Qwen3 Omni Tts Runtime#

class Qwen3OmniTTSRuntime#

Talker runtime for Qwen3-Omni RVQ code generation.

LLM-based codec encoder that generates RVQ codes from text tokens and hidden states. Manages two LLM engines (Talker + CodePredictor) and MLP projection layers.

Pipeline:

  1. MLP Projection: thinker embed (layer 0) → talker embeddings via text_projection

  2. Talker LLM: generate codec tokens autoregressively

  3. CodePredictor: generate 15-layer codebook codes

  4. Return RVQ codes (vocoding done separately at example layer)

Architecture Philosophy:

  • Talker is an LLM decoder, NOT a multimodal input encoder

  • Similar to LLMInferenceRuntime, manages multiple LLM engines

  • Standalone runtime, not dependent on MultimodalRunner hierarchy

  • Code2Wav vocoding is separated for better modularity

Public Functions

Qwen3OmniTTSRuntime(
std::string const &talkerEngineDir,
std::string const &codePredictorEngineDir,
std::string const &tokenizerDir,
cudaStream_t stream
)#

Construct and fully initialize the TTS runtime.

Parameters:
  • talkerEngineDir – Directory containing talker engine, MLP weights, embedding table, etc.

  • codePredictorEngineDir – Directory containing code_predictor engine and codec embeddings

  • tokenizerDir – Directory containing tokenizer files. If empty, defaults to talkerEngineDir/../

  • stream – CUDA stream for operations

Throws:

std::runtime_error – on any initialization failure

~Qwen3OmniTTSRuntime()#

Destructor.

inline std::vector<int32_t> getThinkerHiddenLayerIndices() const#

Get required hidden state layer indices from thinker.

Returns:

Vector containing {0} for layer 0 (embed)

bool handleAudioGeneration(
TalkerGenerationRequest const &request,
TalkerGenerationResponse &response,
cudaStream_t stream
)#

Generate audio with RVQ codes.

Parameters:
  • request – Request containing sampling parameters and input data

  • response – Response containing generated RVQ codes

  • stream – CUDA stream for execution

Returns:

True if generation succeeded, false otherwise

inline metrics::MultimodalMetrics const &getMetrics() const#

Get performance metrics for Talker pipeline.

Returns:

Reference to metrics object

bool captureDecodingCUDAGraph(cudaStream_t stream)#

Capture CUDA graphs for decoding steps (same pattern as LLMInferenceRuntime).

Parameters:

stream – CUDA stream for capture

Returns:

True if all graphs captured successfully

int32_t getSpeakerIdByName(std::string const &speakerName) const#

Get speaker ID by name.

Parameters:

speakerName – Speaker name (e.g., “f245”, “m02”)

Returns:

Speaker ID, or default speaker ID if not found

struct TalkerGenerationRequest#

Talker audio generation request structure.

Contains sampling parameters and input data for audio generation. Sampling parameters are provided per-request (not from config.json).

Public Members

int32_t maxAudioLength = {4096}#

Maximum number of audio codec tokens to generate.

float talkerTemperature = {0}#

Talker temperature (0 = default 0.9)

int32_t talkerTopK = {0}#

Talker top-K (0 = default 50)

float talkerTopP = {0}#

Talker top-P (0 = default 1.0)

float repetitionPenalty = {1.05f}#

Repetition penalty applied to seen codec tokens (1.0 = disabled)

std::string speakerName = {""}#

Speaker name (e.g., “f245”, “m02”) - empty means use default.

int32_t speakerId = {-1}#

Speaker ID - if >= 0, overrides speakerName.

std::vector<Message> messages#
bool applyChatTemplate = {true}#

Whether to apply chat template formatting.

bool addGenerationPrompt = {true}#

Whether to add generation prompt at the end.

bool enableThinking = {false}#

Whether to enable thinking mode.

struct TalkerGenerationResponse#

Talker audio generation response structure.

Contains generated RVQ codes and metadata.

Public Members

std::vector<std::vector<int32_t>> rvqCodes#
int32_t numFrames = {0}#

Number of audio frames generated.

bool success = {false}#

Whether generation succeeded.

struct TalkerGenerationRequest

Talker audio generation request structure.

Contains sampling parameters and input data for audio generation. Sampling parameters are provided per-request (not from config.json).

Public Members

int32_t maxAudioLength = {4096}

Maximum number of audio codec tokens to generate.

float talkerTemperature = {0}

Talker temperature (0 = default 0.9)

int32_t talkerTopK = {0}

Talker top-K (0 = default 50)

float talkerTopP = {0}

Talker top-P (0 = default 1.0)

float repetitionPenalty = {1.05f}

Repetition penalty applied to seen codec tokens (1.0 = disabled)

std::string speakerName = {""}

Speaker name (e.g., “f245”, “m02”) - empty means use default.

int32_t speakerId = {-1}

Speaker ID - if >= 0, overrides speakerName.

std::vector<Message> messages
bool applyChatTemplate = {true}

Whether to apply chat template formatting.

bool addGenerationPrompt = {true}

Whether to add generation prompt at the end.

bool enableThinking = {false}

Whether to enable thinking mode.

struct TalkerGenerationResponse

Talker audio generation response structure.

Contains generated RVQ codes and metadata.

Public Members

std::vector<std::vector<int32_t>> rvqCodes
int32_t numFrames = {0}

Number of audio frames generated.

bool success = {false}

Whether generation succeeded.