Qwen3 Omni Tts Runtime#
-
class Qwen3OmniTTSRuntime#
Talker runtime for Qwen3-Omni RVQ code generation.
LLM-based codec encoder that generates RVQ codes from text tokens and hidden states. Manages two LLM engines (Talker + CodePredictor) and MLP projection layers.
Pipeline:
MLP Projection: thinker embed (layer 0) → talker embeddings via text_projection
Talker LLM: generate codec tokens autoregressively
CodePredictor: generate 15-layer codebook codes
Return RVQ codes (vocoding done separately at example layer)
Architecture Philosophy:
Talker is an LLM decoder, NOT a multimodal input encoder
Similar to LLMInferenceRuntime, manages multiple LLM engines
Standalone runtime, not dependent on MultimodalRunner hierarchy
Code2Wav vocoding is separated for better modularity
Public Functions
- Qwen3OmniTTSRuntime(
- std::string const &talkerEngineDir,
- std::string const &codePredictorEngineDir,
- std::string const &tokenizerDir,
- cudaStream_t stream
Construct and fully initialize the TTS runtime.
- Parameters:
talkerEngineDir – Directory containing talker engine, MLP weights, embedding table, etc.
codePredictorEngineDir – Directory containing code_predictor engine and codec embeddings
tokenizerDir – Directory containing tokenizer files. If empty, defaults to talkerEngineDir/../
stream – CUDA stream for operations
- Throws:
std::runtime_error – on any initialization failure
-
~Qwen3OmniTTSRuntime()#
Destructor.
-
inline std::vector<int32_t> getThinkerHiddenLayerIndices() const#
Get required hidden state layer indices from thinker.
- Returns:
Vector containing {0} for layer 0 (embed)
- bool handleAudioGeneration(
- TalkerGenerationRequest const &request,
- TalkerGenerationResponse &response,
- cudaStream_t stream
Generate audio with RVQ codes.
- Parameters:
request – Request containing sampling parameters and input data
response – Response containing generated RVQ codes
stream – CUDA stream for execution
- Returns:
True if generation succeeded, false otherwise
-
inline metrics::MultimodalMetrics const &getMetrics() const#
Get performance metrics for Talker pipeline.
- Returns:
Reference to metrics object
-
bool captureDecodingCUDAGraph(cudaStream_t stream)#
Capture CUDA graphs for decoding steps (same pattern as LLMInferenceRuntime).
- Parameters:
stream – CUDA stream for capture
- Returns:
True if all graphs captured successfully
-
int32_t getSpeakerIdByName(std::string const &speakerName) const#
Get speaker ID by name.
- Parameters:
speakerName – Speaker name (e.g., “f245”, “m02”)
- Returns:
Speaker ID, or default speaker ID if not found
-
struct TalkerGenerationRequest#
Talker audio generation request structure.
Contains sampling parameters and input data for audio generation. Sampling parameters are provided per-request (not from config.json).
Public Members
-
int32_t maxAudioLength = {4096}#
Maximum number of audio codec tokens to generate.
-
float talkerTemperature = {0}#
Talker temperature (0 = default 0.9)
-
int32_t talkerTopK = {0}#
Talker top-K (0 = default 50)
-
float talkerTopP = {0}#
Talker top-P (0 = default 1.0)
-
float repetitionPenalty = {1.05f}#
Repetition penalty applied to seen codec tokens (1.0 = disabled)
-
std::string speakerName = {""}#
Speaker name (e.g., “f245”, “m02”) - empty means use default.
-
int32_t speakerId = {-1}#
Speaker ID - if >= 0, overrides speakerName.
-
bool applyChatTemplate = {true}#
Whether to apply chat template formatting.
-
bool addGenerationPrompt = {true}#
Whether to add generation prompt at the end.
-
bool enableThinking = {false}#
Whether to enable thinking mode.
-
int32_t maxAudioLength = {4096}#
-
struct TalkerGenerationResponse#
Talker audio generation response structure.
Contains generated RVQ codes and metadata.
-
struct TalkerGenerationRequest
Talker audio generation request structure.
Contains sampling parameters and input data for audio generation. Sampling parameters are provided per-request (not from config.json).
Public Members
-
int32_t maxAudioLength = {4096}
Maximum number of audio codec tokens to generate.
-
float talkerTemperature = {0}
Talker temperature (0 = default 0.9)
-
int32_t talkerTopK = {0}
Talker top-K (0 = default 50)
-
float talkerTopP = {0}
Talker top-P (0 = default 1.0)
-
float repetitionPenalty = {1.05f}
Repetition penalty applied to seen codec tokens (1.0 = disabled)
-
std::string speakerName = {""}
Speaker name (e.g., “f245”, “m02”) - empty means use default.
-
int32_t speakerId = {-1}
Speaker ID - if >= 0, overrides speakerName.
-
std::vector<Message> messages
-
bool applyChatTemplate = {true}
Whether to apply chat template formatting.
-
bool addGenerationPrompt = {true}
Whether to add generation prompt at the end.
-
bool enableThinking = {false}
Whether to enable thinking mode.
-
int32_t maxAudioLength = {4096}
-
struct TalkerGenerationResponse
Talker audio generation response structure.
Contains generated RVQ codes and metadata.
Public Members
-
std::vector<std::vector<int32_t>> rvqCodes
-
int32_t numFrames = {0}
Number of audio frames generated.
-
bool success = {false}
Whether generation succeeded.
-
std::vector<std::vector<int32_t>> rvqCodes