Text MCV (Armenian)#

This config can be used to prepare text corpus to submit to Common Voice https://common-voice.github.io/community-playbook/sub_pages/text.html

This config performs the following data processing.

Create initial manifest by collecling all avalible files with txt expention in raw_data_dir folder.
Read text files line by line.
Normalize text lines using Regex.
Split lines into sentences.
Replaces common transcription errors as well as “non-linguistic”, “unintelligible” and “redacted” flags.
Drops everything with non-Armenian characters.
Drops all utterances that are shorter than 3 words or longer than 15 words.
Extract source book name.
Convert into target csv format.
Get random subsample.

Required arguments.

workspace_dir: specify the workspace folder where all audio files will be stored.

Note that you can customize any part of this config either directly or from command-line.

Here are some common customizations to consider:

Output format.

Output manifest final_manifest.json contain the following fields:

Output manifest manifest13.tsv contain the same data as final_manifest.json but in tsv format.

Output manifest manifest14.tsv contain random subset of data from manifest13.json.