Granary Dataset Creation Pipeline#

Overview#

This configuration drives the Granary pseudo-labelling pipeline – an open-source workflow that transforms large, noisy speech corpora into high-quality Automatic Speech Recognition (ASR) and Automatic Speech Translation (AST) training data for 25 European languages.

The first public release of Granary (≈ 643 k h ASR / ≈ 351 k h AST) was built from three openly available corpora:

and is published as nvidia/Granary.

Note

Per-language runs. The pipeline is executed once per language pair: set

  • source_lang / source_lang_full – audio & transcript language

  • translation.target_lang / target_lang_full – translation language

For example, to obtain English audio with Italian translations choose source_lang: en and translation.target_lang: it. Separate runs are required for each additional language combination.

Note

GPU required. All Whisper, vLLM and Comet-QE stages expect at least one CUDA-capable GPU. Multi-GPU nodes are auto-detected when num_devices: -1 (default) is used.

Software prerequisites#

Install NeMo-speech-data-processor plus the extra wheels required by specific processors:

FasterWhisperInference:

pip install pytorch-lightning \
            "nvidia-cublas-cu12" \
            "nvidia-cudnn-cu12==9.*" \
            faster_whisper

export LD_LIBRARY_PATH=$(python - <<'PY'
import os, nvidia.cublas.lib, nvidia.cudnn.lib
print(os.path.dirname(nvidia.cublas.lib.__file__) + ":" +
      os.path.dirname(nvidia.cudnn.lib.__file__))
PY)

vLLMInference:

pip install "optree>=0.13.0" vllm

CometoidWMTQualityEstimation:

pip install pymarian

FastTextLangIdClassifier:

pip install fasttext

ConvertToTarredAudioDataset (optional, only if tar-sharding is enabled):

pip install lhotse "nemo-toolkit[common]==2.2.1"

Quick start#

  1. Hardware – Linux box with NVIDIA GPU(s) and ≥ 16 GB VRAM (reference runs used A100-80 GB; smaller cards work with reduced batch sizes).

  2. Install NeMo-speech-data-processor and the extras listed above.

  3. Prepare the input manifest and set three mandatory YAML keys:

    • input_manifest_file – manifest with raw audio paths

    • output_dir – working/output directory

    • sdp_dir – root of the SDP tree (for prompt/regex assets)

  4. Run the pipeline:

    # Path to your local clone of NeMo-speech-data-processor
    SDP_DIR=/path/to/NeMo-speech-data-processor
    
    python ${SDP_DIR}/main.py \
        --config-path ${SDP_DIR}/dataset_configs/multilingual/granary/ \
        --config-name  config.yaml \
        input_manifest_file=/path/to/input_manifest.json \
        output_dir=/path/to/output/dir \
        sdp_dir=${SDP_DIR}
    

Input and output formats#

Input manifest

Each line is a JSON object with the source-audio path:

{"source_audio_filepath": "/path/to/file.flac"}

Key outputs

  • ${output_dir}/${source_lang}/manifest_46.json – final bilingual manifest containing audio_filepath, offset, duration, text (source) and answer (translation), plus constant decoder flags.

  • ${output_dir}/${source_lang}/tarred_dataset/(optional) tarred-audio shards and shard_manifest.json when convert_to_audio_tarred_dataset.should_run: True.

  • All intermediate manifest_XX.json files are kept for audit/debug.

Pipeline stages#

The processors executed (indices match the config):

  1. FfmpegConvert (0) – re-encode audio to 16 kHz/mono FLAC.

  2. GetAudioDuration (1) – compute clip length.

  3. RemoveFiles (2) – optionally delete originals (params.save_disk_space).

  4. FasterWhisperInference (3)pass 1 language detection.

  5. LambdaExpression (4) – probability-based LID filtering.

  6. DropSpecifiedFields (5) – remove temporary fields.

  7. FasterWhisperInference (6, 14) – two-pass transcription (second run can slice by offset).

  8. Segmentation & grooming (7–13) – split Whisper segments into atomic utterances.

  9. Hallucination detection (18–20) – drop repeated n-grams, garbage tokens and common filler phrases.

  10. PnC restoration (21–23) – Qwen-2.5-7B restores punctuation & capitalisation; optional regex clean-up.

  11. Length & charset filtering (27–36) – word-ratio, character histogram and FastText checks.

  12. Quality estimation (41–43) – keep pairs with Comet-QE score min_qe_score.

  13. Constant flags (44) – add decoder directives (<|emo:undefined|>, itn, pnc, etc.).

  14. Tarred dataset (46) – shard audio into num_shards tar files (optional).

Tunable parameters#

All knobs live under the params block.

  • Language

    • source_lang / source_lang_full

    • translation.target_lang / target_lang_full

  • Audio duration

    • min_audio_duration – drop very short clips (seconds)

    • max_audio_duration – drop very long clips (seconds)

  • Language-ID & text filtering

    • min_audio_lid_probability – Whisper LID threshold

    • translation.min_hist_token_ratio – charset-purity ratio

    • translation.min_text_lid_probability – FastText LID threshold

  • Length & quality

    • translation.max_len_diff_ratio – max(src / tgt) word ratio

    • translation.min_qe_score – Comet-QE acceptance score

  • Tarred dataset

    • convert_to_audio_tarred_dataset.should_run (bool)

    • num_shards and buckets_num – shard layout

  • Misc.

    • use_regex – regex preset for text normalisation

    • save_disk_space – delete originals after conversion

    • use_dask – enable distributed execution (not recommended)

Advanced usage#

  • Selective execution – override processors_to_run with a range of indices, e.g. "0:25".

  • Model swapping – every inference processor exposes either model_size_or_path (Whisper) or an embedded model: block (vLLM).

  • Resource tuningnum_devices = -1 uses all visible GPUs; set an integer to pin workers per stage.

References#

Config link: dataset_configs/multilingual/granary/config.yaml