Text MCV (Armenian)#
This config can be used to prepare text corpus to submit to Common Voice https://common-voice.github.io/community-playbook/sub_pages/text.html
This config performs the following data processing.
Create initial manifest by collecling all avalible files with txt expention in raw_data_dir folder.
Read text files line by line.
Normalize text lines using Regex.
Split lines into sentences.
Replaces common transcription errors as well as “non-linguistic”, “unintelligible” and “redacted” flags.
Drops everything with non-Armenian characters.
Drops all utterances that are shorter than 3 words or longer than 15 words.
Extract source book name.
Convert into target csv format.
Get random subsample.
Required arguments.
workspace_dir: specify the workspace folder where all audio files will be stored.
Note that you can customize any part of this config either directly or from command-line.
Here are some common customizations to consider:
Output format.
Output manifest final_manifest.json contain the following fields:
Sentence (str): text of sentence to vocalise.
Source (str): source book.
Output manifest manifest13.tsv
contain the same data as final_manifest.json
but in tsv format.
Output manifest manifest14.tsv
contain random subset of data from manifest13.json
.
Config link: dataset_configs/armenian/text_mcv/config.yaml