Supported datasets#
If something that you need is not supported, feel free to raise an issue or try to add the new processing yourself. Contributions from the community are always welcome and encouraged!
The following datasets are already supported by SDP.
Mozilla Common Voice (MCV)#
Dataset link: https://commonvoice.mozilla.org/
Required manual steps: MCV requires agreeing to certain conditions, so you’d need to manually
download the data archive and specify its location with the raw_data_dir
parameter of the
sdp.processors.CreateInitialManifestMCV
class.
Supported configs.
Italian: config | documentation
Spanish: config | documentation
Portuguese: config | documentation
Kazakh: config | documentation
Georgian: config | documentation
Uzbek: config | documentation
Multilingual LibriSpeech (MLS)#
Dataset link: https://www.openslr.org/94/
Supported configs.
Italian (with punctuation and capitalization): config | documentation
Italian (no punctuation and capitalization): config | documentation
Spanish (with punctuation and capitalization): config | documentation
Spanish (no punctuation and capitalization): config | documentation
Portuguese (with punctuation and capitalization): config | documentation
VoxPopuli#
Dataset link: facebookresearch/voxpopuli
Supported configs.
Italian: config | documentation
Spanish: config | documentation
Fisher#
Dataset link: https://catalog.ldc.upenn.edu/LDC2004T19
Required manual steps: You need to manually download the data from the above link.
Supported configs.
Spanish: config | documentation
UK and Ireland English Dialect (SLR83)#
Dataset link: https://openslr.org/83/
Corpus of Regional African American Language (CORAAL)#
Dataset link: https://oraal.uoregon.edu/coraal
Corpus of Armenian Text to Upload into Common Voice (MCV)#
Dataset link: https://commonvoice.mozilla.org/
Corpus based on Armenian audiobooks#
Few-shot Learning Evaluation of Universal Representations of Speech (FLEURS)#
Dataset link: https://huggingface.co/datasets/google/fleurs
Armenian:
Uzbek:
LibriSpeech#
Dataset links: https://openslr.org/12 (regular), https://openslr.org/31 (mini Librispeech)
Supported configs.
- config (for processing one specific subset at a time):
- mini:
- all (for obtaining all subsets in one go):
Coraa Brazilian Portuguese dataset#
Dataset link: nilc-nlp/CORAA
MTEDx#
Dataset link: https://www.openslr.org/100/
Supported configs.
Portuguese: config | documentation
Kazakh Speech Dataset (SLR140)#
Dataset link: https://www.openslr.org/140/
Kazakh Speech Corpus (SLR102)#
Dataset link: https://www.openslr.org/102/
Kazakh Speech Corpus 2 (KSC2)#
Dataset link: https://issai.nu.edu.kz/kz-speech-corpus/
Required manual steps: You need to request the dataset from the website and after getting approval download it manually from Dropbox.
UzbekVoice#
Dataset link: https://corpus.uzbekvoice.ai/en-US
Required manual steps: You need to download the dataset from the google drive provided on the website.