Speech Synthesis¶
Models¶
Currently we support following models:
Model description | Config file | Audio Samples | Checkpoint |
---|---|---|---|
Tacotron-2 | tacotron_float.py | here | link |
Tacotron-2 GST | tacotron_gst.py | N/A | link |
WaveNet | wavenet_float.py | N/A | N/A |
Centaur | centaur_float.py | here | link |
The model specification and training parameters can be found in the corresponding config file.
Getting started¶
The current Tacotron 2 implementation supports the LJSpeech dataset and the MAILABS dataset. For more details about the model including hyperparameters and tips, see Tacotron-2. The current WaveNet implementation only supports LJSpeech.
It is recommended to start with the LJSpeech dataset to familiarize yourself with the data layer.
Get data¶
First, you need to download and extract the dataset into a directory of your choice. The extracted file should consist of a metadata.csv file and a directory of wav files. metadata.csv lists all the wav filename and their corresponding transcripts delimited by the ‘|’ character.
Training¶
Both WaveNet and Tacotron 2 can be trained using LJSpeech. For this:
- change
dataset_location
under to point to the directory containing the metadata.csv file. - rename
metadata.csv
totrain.csv
To start training Tacotron:
python run.py --config_file=example_configs/text2speech/tacotron_float.py --mode=train
Similarly, to start training WaveNet:
python run.py --config_file=example_configs/text2speech/wavenet_float.py --mode=train
If your GPU does not have enough memory, reduce the batch_size_per_gpu
parameter.
Inference¶
Once training is done (this can take a while on a single GPU), you can run
inference. To do some, first create a csv file named test.csv
in the same
location as train.csv
with lines in the following format:
UNUSED | UNUSED | This is an example sentence that I want to generate.
You can put as many lines inside the csv as you want. The model will produce
one audio sample per line and save the audio sample inside your log_dir
.
Lastly, run
python run.py --config_file=example_configs/text2speech/tacotron_float.py --mode=infer --infer_output_file=unused
For WaveNet, only interactive infer is supported. First, start a jupyter notebook in the root
directory and replace the contents of the first box of with tacotron_save_spec.py.
This will save the spectrogram generated by Tacotron as a numpy array in spec.npy
.
Next, replace the contents of the first box with wavenet_naive_infer.py
and re-run the notebook. The generated audio will be saved to result/sample_step0_infer.wav
every 1000 steps. Note that this will take some time.