.. _speech_synthesis: Speech Synthesis ================ ###### Models ###### Currently we support following models: .. list-table:: :widths: 1 3 1 1 :header-rows: 1 * - Model description - Config file - Audio Samples - Checkpoint * - :doc:`Tacotron-2 ` - `tacotron_float.py `_ - :doc:`here ` - `link `_ * - :doc:`Tacotron-2 GST` - `tacotron_gst.py `_ - N/A - `link `_ * - :doc:`WaveNet ` - `wavenet_float.py `_ - N/A - N/A * - :doc:`Centaur ` - `centaur_float.py `_ - :doc:`here ` - `link `_ The model specification and training parameters can be found in the corresponding config file. .. toctree:: :hidden: :maxdepth: 1 speech-synthesis/tacotron-2 speech-synthesis/tacotron-2-gst speech-synthesis/wavenet speech-synthesis/centaur ################ Getting started ################ The current Tacotron 2 implementation supports the `LJSpeech `_ dataset and the `MAILABS `_ dataset. For more details about the model including hyperparameters and tips, see :doc:`Tacotron-2 `. The current WaveNet implementation only supports LJSpeech. It is recommended to start with the LJSpeech dataset to familiarize yourself with the data layer. ******** Get data ******** First, you need to download and extract the dataset into a directory of your choice. The extracted file should consist of a metadata.csv file and a directory of wav files. metadata.csv lists all the wav filename and their corresponding transcripts delimited by the '|' character. ******** Training ******** Both WaveNet and Tacotron 2 can be trained using LJSpeech. For this: * change ``dataset_location`` under to point to the directory containing the metadata.csv file. * rename ``metadata.csv`` to ``train.csv`` To start training Tacotron:: python run.py --config_file=example_configs/text2speech/tacotron_float.py --mode=train Similarly, to start training WaveNet:: python run.py --config_file=example_configs/text2speech/wavenet_float.py --mode=train If your GPU does not have enough memory, reduce the ``batch_size_per_gpu`` parameter. *********** Inference *********** Once training is done (this can take a while on a single GPU), you can run inference. To do some, first create a csv file named ``test.csv`` in the same location as ``train.csv`` with lines in the following format:: UNUSED | UNUSED | This is an example sentence that I want to generate. You can put as many lines inside the csv as you want. The model will produce one audio sample per line and save the audio sample inside your ``log_dir``. Lastly, run :: python run.py --config_file=example_configs/text2speech/tacotron_float.py --mode=infer --infer_output_file=unused For WaveNet, only interactive infer is supported. First, start a jupyter notebook in the root directory and replace the contents of the first box of with `tacotron_save_spec.py `_. This will save the spectrogram generated by Tacotron as a numpy array in ``spec.npy``. Next, replace the contents of the first box with `wavenet_naive_infer.py `_ and re-run the notebook. The generated audio will be saved to ``result/sample_step0_infer.wav`` every 1000 steps. Note that this will take some time.