Unlabeled Data Processing Pipeline#

This pipeline processes unlabeled data for iterative pseudo-labeling training.

The pipeline performs the following steps: 1. Creates an initial manifest by searching for all WAV files in the raw_data_dir folder. 2. Counts the duration of each WAV file. 3. Identifies the language using the langid_ambernet NeMo model. 4. Filters out audios that are tagged with a different language. 5. Filters out audios that are too long to be processed. 6. Applies the VAD algorithm from the NeMo repository. 7. Forms segments by joining adjacent segments up to a duration threshold. 8. Splits long audios into shorter segments. 9. Removes empty files and extra fields from the manifest.

Required inputs:

workspace_dir: Directory for intermediate files, containing the following subfolders:
${workspace_dir}/wavs/ - Folder with source long files.
${workspace_dir}/sdp/ - Folder to store manifests.
${workspace_dir}/sdp/vad/ - Folder to store temporary files from the VAD algorithm.
${workspace_dir}/splited_wavs/ - Folder to store split short files.

language_short: Two-letter language code.
nemo_path: Path to NeMo installation.
final_manifest: Path to the final output manifest.

Config link: dataset_configs/portuguese/unlabeled/config.yaml