Speech Synthesis

Models

Currently we support following models:

Model description Config file Audio Samples Checkpoint
Tacotron-2 tacotron_float.py here link
Tacotron-2 GST tacotron_gst.py N/A link
WaveNet wavenet_float.py N/A N/A
Centaur centaur_float.py here link

The model specification and training parameters can be found in the corresponding config file.

Getting started

The current Tacotron 2 implementation supports the LJSpeech dataset and the MAILABS dataset. For more details about the model including hyperparameters and tips, see Tacotron-2. The current WaveNet implementation only supports LJSpeech.

It is recommended to start with the LJSpeech dataset to familiarize yourself with the data layer.

Get data

First, you need to download and extract the dataset into a directory of your choice. The extracted file should consist of a metadata.csv file and a directory of wav files. metadata.csv lists all the wav filename and their corresponding transcripts delimited by the ‘|’ character.

Training

Both WaveNet and Tacotron 2 can be trained using LJSpeech. For this:

  • change dataset_location under to point to the directory containing the metadata.csv file.
  • rename metadata.csv to train.csv

To start training Tacotron:

python run.py --config_file=example_configs/text2speech/tacotron_float.py --mode=train

Similarly, to start training WaveNet:

python run.py --config_file=example_configs/text2speech/wavenet_float.py --mode=train

If your GPU does not have enough memory, reduce the batch_size_per_gpu parameter.

Inference

Once training is done (this can take a while on a single GPU), you can run inference. To do some, first create a csv file named test.csv in the same location as train.csv with lines in the following format:

UNUSED | UNUSED | This is an example sentence that I want to generate.

You can put as many lines inside the csv as you want. The model will produce one audio sample per line and save the audio sample inside your log_dir. Lastly, run

python run.py --config_file=example_configs/text2speech/tacotron_float.py --mode=infer --infer_output_file=unused

For WaveNet, only interactive infer is supported. First, start a jupyter notebook in the root directory and replace the contents of the first box of with tacotron_save_spec.py. This will save the spectrogram generated by Tacotron as a numpy array in spec.npy. Next, replace the contents of the first box with wavenet_naive_infer.py and re-run the notebook. The generated audio will be saved to result/sample_step0_infer.wav every 1000 steps. Note that this will take some time.