TTS data processing pipeline

TTS data processing pipeline#

This pipeline processes YouTube Commons (YTC) data for text-to-speech (TTS) training.

The pipeline performs the following steps: 1. Creates initial manifest by resampling audio to 16kHz mono WAV format 2. Runs speaker diarization and overlap detection using pyannote 3. Splits long audio segments 4. Aligns text and audio using NeMo ASR models 5. Joins split audio metadata back together 6. Merges alignment and diarization information 7. Performs inverse text normalization 8. Calculates audio quality metrics using TorchSQUIM 9. Estimates audio bandwidth 10. Prepares TTS segments

Required inputs:

  • input_manifest_file: Path to input manifest json file - manifest must contain “audio_filepath” and “audio_item_id” fields - example: {“audio_filepath”: “path/to/raw/audio/file.wav”, “audio_item_id”: “some_unique_id”}

  • hf_token: HuggingFace token for pyannote access

  • data_split: Data split name (train/dev/test)

  • workspace_dir: Directory for intermediate files

  • language_short: 2-letter language code

  • nemo_path: Path to NeMo installation

  • final_manifest: Path for final output manifest

Config link: dataset_configs/tts/ytc/config.yaml