Supported datasets#

If something that you need is not supported, feel free to raise an issue or try to add the new processing yourself. Contributions from the community are always welcome and encouraged!

The following datasets are already supported by SDP.

Mozilla Common Voice (MCV)#

Dataset link: https://commonvoice.mozilla.org/

Required manual steps: MCV requires agreeing to certain conditions, so you’d need to manually download the data archive and specify its location with the raw_data_dir parameter of the sdp.processors.CreateInitialManifestMCV class.

Supported configs.

Italian: config | documentation
Spanish: config | documentation
Portuguese: config | documentation
Kazakh: config | documentation
Georgian: config | documentation
Uzbek: config | documentation
Arabic: config | documentation

Multilingual LibriSpeech (MLS)#

Dataset link: https://www.openslr.org/94/

Supported configs.

Italian (with punctuation and capitalization): config | documentation
Italian (no punctuation and capitalization): config | documentation
Spanish (with punctuation and capitalization): config | documentation
Spanish (no punctuation and capitalization): config | documentation
Portuguese (with punctuation and capitalization): config | documentation

VoxPopuli#

Dataset link: facebookresearch/voxpopuli

Supported configs.

Italian: config | documentation
Spanish: config | documentation

Fisher#

Dataset link: https://catalog.ldc.upenn.edu/LDC2004T19

Required manual steps: You need to manually download the data from the above link.

Supported configs.

Spanish: config | documentation

UK and Ireland English Dialect (SLR83)#

Dataset link: https://openslr.org/83/

config | documentation

Corpus of Regional African American Language (CORAAL)#

Dataset link: https://oraal.uoregon.edu/coraal

config | documentation

Corpus of Armenian Text to Upload into Common Voice (MCV)#

Dataset link: https://commonvoice.mozilla.org/

config | documentation

Corpus based on Armenian audiobooks#

config | documentation

Few-shot Learning Evaluation of Universal Representations of Speech (FLEURS)#

Dataset link: https://huggingface.co/datasets/google/fleurs

Armenian:
config | documentation
Uzbek:
config | documentation
Arabic: config | documentation

LibriSpeech#

Dataset links: https://openslr.org/12 (regular), https://openslr.org/31 (mini Librispeech)

Supported configs.

config (for processing one specific subset at a time):
config | documentation
mini:
config | documentation
all (for obtaining all subsets in one go):
config | documentation