Currently we support following models:
|Model description||Config file||Audio Samples||Checkpoint|
The model specification and training parameters can be found in the corresponding config file.
The current Tacotron 2 implementation supports the LJSpeech dataset and the MAILABS dataset. For more details about the model including hyperparameters and tips, see Tacotron-2. The current WaveNet implementation only supports LJSpeech.
It is recommended to start with the LJSpeech dataset to familiarize yourself with the data layer.
First, you need to download and extract the dataset into a directory of your choice. The extracted file should consist of a metadata.csv file and a directory of wav files. metadata.csv lists all the wav filename and their corresponding transcripts delimited by the ‘|’ character.
Both WaveNet and Tacotron 2 can be trained using LJSpeech. For this:
dataset_locationunder to point to the directory containing the metadata.csv file.
To start training Tacotron:
python run.py --config_file=example_configs/text2speech/tacotron_float.py --mode=train
Similarly, to start training WaveNet:
python run.py --config_file=example_configs/text2speech/wavenet_float.py --mode=train
If your GPU does not have enough memory, reduce the
Once training is done (this can take a while on a single GPU), you can run
inference. To do some, first create a csv file named
test.csv in the same
train.csv with lines in the following format:
UNUSED | UNUSED | This is an example sentence that I want to generate.
You can put as many lines inside the csv as you want. The model will produce
one audio sample per line and save the audio sample inside your
python run.py --config_file=example_configs/text2speech/tacotron_float.py --mode=infer --infer_output_file=unused
For WaveNet, only interactive infer is supported. First, start a jupyter notebook in the root
directory and replace the contents of the first box of with tacotron_save_spec.py.
This will save the spectrogram generated by Tacotron as a numpy array in
Next, replace the contents of the first box with wavenet_naive_infer.py
and re-run the notebook. The generated audio will be saved to
every 1000 steps. Note that this will take some time.