Text MCV (Armenian)

Text MCV (Armenian)#

This config can be used to prepare text corpus to submit to Common Voice https://common-voice.github.io/community-playbook/sub_pages/text.html

This config performs the following data processing.

  1. Create initial manifest by collecling all avalible files with txt expention in raw_data_dir folder.

  2. Read text files line by line.

  3. Normalize text lines using Regex.

  4. Split lines into sentences.

  5. Replaces common transcription errors as well as “non-linguistic”, “unintelligible” and “redacted” flags.

  6. Drops everything with non-Armenian characters.

  7. Drops all utterances that are shorter than 3 words or longer than 15 words.

  8. Extract source book name.

  9. Convert into target csv format.

  10. Get random subsample.

Required arguments.

  • workspace_dir: specify the workspace folder where all audio files will be stored.

Note that you can customize any part of this config either directly or from command-line.

Here are some common customizations to consider:

Output format.

Output manifest final_manifest.json contain the following fields:

  • Sentence (str): text of sentence to vocalise.

  • Source (str): source book.

Output manifest manifest13.tsv contain the same data as final_manifest.json but in tsv format.

Output manifest manifest14.tsv contain random subset of data from manifest13.json.

Config link: dataset_configs/armenian/text_mcv/config.yaml