Supported datasets#

If something that you need is not supported, feel free to raise an issue or try to add the new processing yourself. Contributions from the community are always welcome and encouraged!

The following datasets are already supported by SDP.

Mozilla Common Voice (MCV)#

Dataset link: https://commonvoice.mozilla.org/

Required manual steps: MCV requires agreeing to certain conditions, so you’d need to manually download the data archive and specify its location with the raw_data_dir parameter of the sdp.processors.CreateInitialManifestMCV class.

Supported configs.

Multilingual LibriSpeech (MLS)#

Dataset link: https://www.openslr.org/94/

Supported configs.

VoxPopuli#

Dataset link: facebookresearch/voxpopuli

Supported configs.

Fisher#

Dataset link: https://catalog.ldc.upenn.edu/LDC2004T19

Required manual steps: You need to manually download the data from the above link.

Supported configs.

UK and Ireland English Dialect (SLR83)#

Dataset link: https://openslr.org/83/

config | documentation

Corpus of Regional African American Language (CORAAL)#

Dataset link: https://oraal.uoregon.edu/coraal

config | documentation

Corpus of Armenian Text to Upload into Common Voice (MCV)#

Dataset link: https://commonvoice.mozilla.org/

config | documentation

Corpus based on Armenian audiobooks#

config | documentation

Few-shot Learning Evaluation of Universal Representations of Speech (FLEURS)#

Dataset link: https://huggingface.co/datasets/google/fleurs

  • Armenian:

config | documentation

  • Uzbek:

config | documentation

LibriSpeech#

Dataset links: https://openslr.org/12 (regular), https://openslr.org/31 (mini Librispeech)

Supported configs.

Coraa Brazilian Portuguese dataset#

Dataset link: nilc-nlp/CORAA

config | documentation

MTEDx#

Dataset link: https://www.openslr.org/100/

Supported configs.

Kazakh Speech Dataset (SLR140)#

Dataset link: https://www.openslr.org/140/

config | documentation

Kazakh Speech Corpus (SLR102)#

Dataset link: https://www.openslr.org/102/

config | documentation

Kazakh Speech Corpus 2 (KSC2)#

Dataset link: https://issai.nu.edu.kz/kz-speech-corpus/

Required manual steps: You need to request the dataset from the website and after getting approval download it manually from Dropbox.

config | documentation

UzbekVoice#

Dataset link: https://corpus.uzbekvoice.ai/en-US

Required manual steps: You need to download the dataset from the google drive provided on the website.

config | documentation