Granary Dataset Creation Pipeline#
Overview#
This configuration drives the Granary pseudo-labelling pipeline – an open-source workflow that transforms large, noisy speech corpora into high-quality Automatic Speech Recognition (ASR) and Automatic Speech Translation (AST) training data for 25 European languages.
The first public release of Granary (≈ 643 k h ASR / ≈ 351 k h AST) was built from three openly available corpora:
and is published as nvidia/Granary.
Note
Per-language runs. The pipeline is executed once per language pair: set
source_lang
/source_lang_full
– audio & transcript languagetranslation.target_lang
/target_lang_full
– translation language
For example, to obtain English audio with Italian translations choose
source_lang: en
and translation.target_lang: it
.
Separate runs are required for each additional language combination.
Note
GPU required. All Whisper, vLLM and Comet-QE stages expect at
least one CUDA-capable GPU. Multi-GPU nodes are auto-detected when
num_devices: -1
(default) is used.
Software prerequisites#
Install NeMo-speech-data-processor plus the extra wheels required by specific processors:
FasterWhisperInference
:
pip install pytorch-lightning \
"nvidia-cublas-cu12" \
"nvidia-cudnn-cu12==9.*" \
faster_whisper
export LD_LIBRARY_PATH=$(python - <<'PY'
import os, nvidia.cublas.lib, nvidia.cudnn.lib
print(os.path.dirname(nvidia.cublas.lib.__file__) + ":" +
os.path.dirname(nvidia.cudnn.lib.__file__))
PY)
vLLMInference
:
pip install "optree>=0.13.0" vllm
CometoidWMTQualityEstimation
:
pip install pymarian
FastTextLangIdClassifier
:
pip install fasttext
ConvertToTarredAudioDataset
(optional, only if tar-sharding is enabled):
pip install lhotse "nemo-toolkit[common]==2.2.1"
Quick start#
Hardware – Linux box with NVIDIA GPU(s) and ≥ 16 GB VRAM (reference runs used A100-80 GB; smaller cards work with reduced batch sizes).
Install NeMo-speech-data-processor and the extras listed above.
Prepare the input manifest and set three mandatory YAML keys:
input_manifest_file
– manifest with raw audio pathsoutput_dir
– working/output directorysdp_dir
– root of the SDP tree (for prompt/regex assets)
Run the pipeline:
# Path to your local clone of NeMo-speech-data-processor SDP_DIR=/path/to/NeMo-speech-data-processor python ${SDP_DIR}/main.py \ --config-path ${SDP_DIR}/dataset_configs/multilingual/granary/ \ --config-name config.yaml \ input_manifest_file=/path/to/input_manifest.json \ output_dir=/path/to/output/dir \ sdp_dir=${SDP_DIR}
Input and output formats#
Input manifest
Each line is a JSON object with the source-audio path:
{"source_audio_filepath": "/path/to/file.flac"}
Key outputs
${output_dir}/${source_lang}/manifest_46.json
– final bilingual manifest containingaudio_filepath
,offset
,duration
,text
(source) andanswer
(translation), plus constant decoder flags.${output_dir}/${source_lang}/tarred_dataset/
– (optional) tarred-audio shards andshard_manifest.json
whenconvert_to_audio_tarred_dataset.should_run: True
.All intermediate
manifest_XX.json
files are kept for audit/debug.
Pipeline stages#
The processors executed (indices match the config):
FfmpegConvert (0) – re-encode audio to 16 kHz/mono FLAC.
GetAudioDuration (1) – compute clip length.
RemoveFiles (2) – optionally delete originals (
params.save_disk_space
).FasterWhisperInference (3) – pass 1 language detection.
LambdaExpression (4) – probability-based LID filtering.
DropSpecifiedFields (5) – remove temporary fields.
FasterWhisperInference (6, 14) – two-pass transcription (second run can slice by offset).
Segmentation & grooming (7–13) – split Whisper segments into atomic utterances.
Hallucination detection (18–20) – drop repeated n-grams, garbage tokens and common filler phrases.
PnC restoration (21–23) – Qwen-2.5-7B restores punctuation & capitalisation; optional regex clean-up.
Length & charset filtering (27–36) – word-ratio, character histogram and FastText checks.
Quality estimation (41–43) – keep pairs with Comet-QE
score ≥ min_qe_score
.Constant flags (44) – add decoder directives (
<|emo:undefined|>
, itn, pnc, etc.).Tarred dataset (46) – shard audio into
num_shards
tar files (optional).
Tunable parameters#
All knobs live under the params
block.
Language
source_lang
/source_lang_full
translation.target_lang
/target_lang_full
Audio duration
min_audio_duration
– drop very short clips (seconds)max_audio_duration
– drop very long clips (seconds)
Language-ID & text filtering
min_audio_lid_probability
– Whisper LID thresholdtranslation.min_hist_token_ratio
– charset-purity ratiotranslation.min_text_lid_probability
– FastText LID threshold
Length & quality
translation.max_len_diff_ratio
– max(src / tgt) word ratiotranslation.min_qe_score
– Comet-QE acceptance score
Tarred dataset
convert_to_audio_tarred_dataset.should_run
(bool)num_shards
andbuckets_num
– shard layout
Misc.
use_regex
– regex preset for text normalisationsave_disk_space
– delete originals after conversionuse_dask
– enable distributed execution (not recommended)
Advanced usage#
Selective execution – override
processors_to_run
with a range of indices, e.g."0:25"
.Model swapping – every inference processor exposes either
model_size_or_path
(Whisper) or an embeddedmodel:
block (vLLM).Resource tuning –
num_devices = -1
uses all visible GPUs; set an integer to pin workers per stage.
References#
Koluguri et al. (2025). Granary: Speech Recognition and Translation Dataset in 25 European Languages (preprint). arXiv: 2505.13404.
Granary dataset on Hugging Face: nvidia/Granary.
NeMo-SDP source code: NVIDIA/NeMo-speech-data-processor.
Config link: dataset_configs/multilingual/granary/config.yaml