Supported datasets#

If something that you need is not supported, feel free to raise an issue or try to add the new processing yourself. Contributions from the community are always welcome and encouraged!

The following datasets are already supported by SDP.

Mozilla Common Voice (MCV)#

Dataset link: https://commonvoice.mozilla.org/

Required manual steps: MCV requires agreeing to certain conditions, so you’d need to manually download the data archive and specify its location with the raw_data_dir parameter of the sdp.processors.CreateInitialManifestMCV class.

Supported configs.

Multilingual LibriSpeech (MLS)#

Dataset link: https://www.openslr.org/94/

Supported configs.

VoxPopuli#

Dataset link: facebookresearch/voxpopuli

Supported configs.

Fisher#

Dataset link: https://catalog.ldc.upenn.edu/LDC2004T19

Required manual steps: You need to manually download the data from the above link.

Supported configs.

UK and Ireland English Dialect (SLR83)#

Dataset link: https://openslr.org/83/

config | documentation

Corpus of Regional African American Language (CORAAL)#

Dataset link: https://oraal.uoregon.edu/coraal

config | documentation

Corpus of Armenian Text to Upload into Common Voice (MCV)#

Dataset link: https://commonvoice.mozilla.org/

config | documentation

Corpus based on Armenian audiobooks#

config | documentation

Few-shot Learning Evaluation of Universal Representations of Speech (FLEURS)#

Dataset link: https://huggingface.co/datasets/google/fleurs

config | documentation

English LibriSpeech (ELS)#

Dataset link: https://openslr.org/12

config | documentation

Coraa Brazilian Portuguese dataset#

Dataset link: nilc-nlp/CORAA

config | documentation

MTEDx#

Dataset link: https://www.openslr.org/100/

Supported configs.