text2speech¶
text2speech¶
-
class
data.text2speech.text2speech.
Text2SpeechDataLayer
(params, model, num_workers=None, worker_id=None)[source]¶ Bases:
open_seq2seq.data.data_layer.DataLayer
Text-to-speech data layer class
-
__init__
(params, model, num_workers=None, worker_id=None)[source]¶ Text-to-speech data layer constructor.
See parent class for arguments description.
Config parameters:
- dataset (str) — The dataset to use. Currently ‘LJ’ for the LJSpeech 1.1 dataset is supported.
- num_audio_features (int) — number of audio features to extract.
- output_type (str) — could be either “magnitude”, or “mel”.
- vocab_file (str) — path to vocabulary file.
- dataset_files (list) — list with paths to all dataset .csv files. File is assumed to be separated by “|”.
- dataset_location (string) — string with path to directory where wavs are stored.
- feature_normalize (bool) — whether to normlize the data with a preset mean and std
- feature_normalize_mean (bool) — used for feature normalize. Defaults to 0.
- feature_normalize_std (bool) — used for feature normalize. Defaults to 1.
- mag_power (int) — the power to which the magnitude spectrogram is scaled to. Defaults to 1. 1 for energy spectrogram 2 for power spectrogram Defaults to 2.
- pad_EOS (bool) — whether to apply EOS tokens to both the text and the speech signal. Will pad at least 1 token regardless of pad_to value. Defaults to True.
- pad_value (float) — The value we pad the spectrogram with. Defaults to np.log(data_min).
- pad_to (int) — we pad such that the resulting datapoint is a multiple of pad_to. Defaults to 8.
- trim (bool) — Whether to trim silence via librosa or not. Defaults to False.
- data_min (float) — min clip value prior to taking the log. Defaults to 1e-5. Please change to 1e-2 if using htk mels.
- duration_min (int) — Minimum duration in steps for speech signal. All signals less than this will be cut from the training set. Defaults to 0.
- duration_max (int) — Maximum duration in steps for speech signal. All signals greater than this will be cut from the training set. Defaults to 4000.
- mel_type (str) — One of [‘slaney’, ‘htk’]. Decides which algorithm to use to compute mel specs. Defaults to htk.
- style_input (str) — Can be either None or “wav”. Must be set to “wav” for GST. Defaults to None.
- n_samples_train (int) — number of the shortest examples to use for training.
- n_samples_eval (int) — number of the shortest examples to use for evaluation.
- n_fft (int) — FFT window size.
- fmax (float) — highest frequency to use.
- max_normalization (bool) — whether to divide the final audio signal by its’ absolute maximum.
- use_cache (bool) — whether to use cache.
-
_parse_audio_transcript_element
(element)[source]¶ Parses tf.data element from TextLineDataset into audio and text.
Parameters: element – tf.data element from TextLineDataset. Returns: text_input text as np.array of ids, text_input length, target audio features as np.array, stop token targets as np.array, length of target sequence. Return type: tuple
-
_parse_transcript_element
(transcript)[source]¶ Parses text from file and returns array of text features.
Parameters: transcript – the string to parse. Returns: target text as np.array of ids, target text length. Return type: tuple
-
create_feed_dict
(model_in)[source]¶ Creates the feed dict for interactive infer
Parameters: model_in (str) – The string to be spoken. Returns: Dictionary with values for the placeholders. Return type: feed_dict (dict)
-
create_interactive_placeholders
()[source]¶ A function that must be defined for data layers that support interactive infer. This function is intended to create placeholders that will be passed to self._input_tensors that will be passed to the model.
-
get_magnitude_spec
(spectrogram, is_mel=False)[source]¶ Returns an energy magnitude spectrogram. The processing depends on the data layer params.
Parameters: spectrogram – output spec from model Returns: mag spec Return type: mag_spec
-
static
get_optional_params
()[source]¶ Static method with description of optional parameters.
Returns: Dictionary containing all the parameters that can be included into the params
parameter of the class__init__()
method.Return type: dict
-
static
get_required_params
()[source]¶ Static method with description of required parameters.
Returns: Dictionary containing all the parameters that have to be included into the params
parameter of the class__init__()
method.Return type: dict
-
input_tensors
¶ Dictionary containing input tensors. This dictionary has to define the following keys: source_tensors, which should contain all tensors describing the input object (i.e. tensors that are passed to the encoder, e.g. input sequence and input length). And when
self.params['mode'] != "infer"
data layer should also define target_tensors which is the list of all tensors related to the corresponding target object (i.e. tensors taht are passed to the decoder and loss, e.g. target sequence and target length). Note that all tensors have to be created insideself.build_graph()
method.
-
iterator
¶ tf.data.Dataset
iterator. Should be created byself.build_graph()
.
-
n_fft
¶
-
sampling_rate
¶
-
text2speech_wavenet¶
-
class
data.text2speech.text2speech_wavenet.
WavenetDataLayer
(params, model, num_workers=None, worker_id=None)[source]¶ Bases:
open_seq2seq.data.data_layer.DataLayer
Text to speech data layer class for Wavenet
-
__init__
(params, model, num_workers=None, worker_id=None)[source]¶ Wavenet data layer constructor.
See parent class for arguments description.
Config parameters:
- num_audio_features (int) — number of spectrogram audio features
- dataset_files (list) — list with paths to all dataset .csv files
- dataset_location (str) — string with path to directory where wavs are stored
-
create_feed_dict
(model_in)[source]¶ Creates the feed dict for interactive infer using a spectrogram
Parameters: model_in – tuple containing source audio, length of the source, conditioning spectrogram, length of the spectrogram, index of receptive field window
-
create_interactive_placeholders
()[source]¶ A function that must be defined for data layers that support interactive infer. This function is intended to create placeholders that will be passed to self._input_tensors that will be passed to the model.
-
static
get_optional_params
()[source]¶ Static method with description of optional parameters.
Returns: Dictionary containing all the parameters that can be included into the params
parameter of the class__init__()
method.Return type: dict
-
static
get_required_params
()[source]¶ Static method with description of required parameters.
Returns: Dictionary containing all the parameters that have to be included into the params
parameter of the class__init__()
method.Return type: dict
-
get_size_in_samples
()[source]¶ Should return the dataset size in samples. That is, the number of objects in the dataset. This method is used to calculate a valid epoch size. If this method is not defined, you will need to make sure that your dataset for evaluation is created only for one epoch. You will also not be able to use
num_epochs
parameter in the base config.Returns: dataset size in samples. Return type: int
-
input_tensors
¶ Dictionary containing input tensors. This dictionary has to define the following keys: source_tensors, which should contain all tensors describing the input object (i.e. tensors that are passed to the encoder, e.g. input sequence and input length). And when
self.params['mode'] != "infer"
data layer should also define target_tensors which is the list of all tensors related to the corresponding target object (i.e. tensors taht are passed to the decoder and loss, e.g. target sequence and target length). Note that all tensors have to be created insideself.build_graph()
method.
-
iterator
¶ tf.data.Dataset
iterator. Should be created byself.build_graph()
.
-
speech_utils¶
-
data.text2speech.speech_utils.
denormalize
(features, mean, std)[source]¶ Normalizes features with the specificed mean and std
-
data.text2speech.speech_utils.
get_mel
(log_mag_spec, fs=22050, n_fft=1024, n_mels=80, power=2.0, feature_normalize=False, mean=0, std=1, mel_basis=None, data_min=1e-05, htk=True, norm=None)[source]¶ Method to get mel spectrograms from magnitude spectrograms
Parameters: - log_mag_spec (np.array) – log of the magnitude spec
- fs (int) – sampling frequency in Hz
- n_fft (int) – size of fft window in samples
- n_mels (int) – number of mel features
- power (float) – power of the mag spectrogram
- feature_normalize (bool) – whether the mag spec was normalized
- mean (float) – normalization param of mag spec
- std (float) – normalization param of mag spec
- mel_basis (np.array) – optional pre-computed mel basis to save computational time if passed. If not passed, it will call librosa to construct one
- data_min (float) – min clip value prior to taking the log.
- htk (bool) – whther to compute the mel spec with the htk or slaney algorithm
- norm – Should be None for htk, and 1 for slaney
Returns: mel_spec with shape [time, n_mels]
Return type: np.array
-
data.text2speech.speech_utils.
get_speech_features
(signal, fs, num_features, features_type='magnitude', n_fft=1024, hop_length=256, mag_power=2, feature_normalize=False, mean=0.0, std=1.0, data_min=1e-05, mel_basis=None)[source]¶ Helper function to retrieve spectrograms from loaded wav
Parameters: - signal – signal loaded with librosa.
- fs (int) – sampling frequency in Hz.
- num_features (int) – number of speech features in frequency domain.
- features_type (string) – ‘magnitude’ or ‘mel’.
- n_fft (int) – size of analysis window in samples.
- hop_length (int) – stride of analysis window in samples.
- mag_power (int) – power to raise magnitude spectrograms (prior to dot product with mel basis) 1 for energy spectrograms 2 fot power spectrograms
- feature_normalize (bool) – whether to normalize the data with mean and std
- mean (float) – if normalize is enabled, the mean to normalize to
- std (float) – if normalize is enabled, the deviation to normalize to
- data_min (float) – min clip value prior to taking the log.
Returns: np.array of audio features with shape=[num_time_steps, num_features].
Return type: np.array
-
data.text2speech.speech_utils.
get_speech_features_from_file
(filename, num_features, features_type='magnitude', n_fft=1024, hop_length=None, mag_power=2, feature_normalize=False, mean=0.0, std=1.0, trim=False, data_min=1e-05, return_raw_audio=False, return_audio_duration=False, augmentation=None, mel_basis=None)[source]¶ Helper function to retrieve spectrograms from wav files
Parameters: - filename (string) – WAVE filename.
- num_features (int) – number of speech features in frequency domain.
- features_type (string) – ‘magnitude’ or ‘mel’.
- n_fft (int) – size of analysis window in samples.
- hop_length (int) – stride of analysis window in samples.
- mag_power (int) – power to raise magnitude spectrograms (prior to dot product with mel basis) 1 for energy spectrograms 2 fot power spectrograms
- feature_normalize (bool) – whether to normalize the data with mean and std
- mean (float) – if normalize is enabled, the mean to normalize to
- std (float) – if normalize is enabled, the deviation to normalize to
- trim (bool) – Whether to trim silence via librosa or not
- data_min (float) – min clip value prior to taking the log.
Returns: np.array of audio features with shape=[num_time_steps, num_features].
Return type: np.array
-
data.text2speech.speech_utils.
inverse_mel
(log_mel_spec, fs=22050, n_fft=1024, n_mels=80, power=2.0, feature_normalize=False, mean=0, std=1, mel_basis=None, htk=True, norm=None)[source]¶ Reconstructs magnitude spectrogram from a mel spectrogram by multiplying it with the transposed mel basis.
Parameters: - log_mel_spec (np.array) – log of the mel spec
- fs (int) – sampling frequency in Hz
- n_fft (int) – size of fft window in samples
- n_mels (int) – number of mel features
- power (float) – power of the mag spectrogram that was used to generate the mel spec
- feature_normalize (bool) – whether the mel spec was normalized
- mean (float) – normalization param of mel spec
- std (float) – normalization param of mel spec
- mel_basis (np.array) – optional pre-computed mel basis to save computational time if passed. If not passed, it will call librosa to construct one
- htk (bool) – whther to compute the mel spec with the htk or slaney algorithm
- norm – Should be None for htk, and 1 for slaney
Returns: mag_spec with shape [time, n_fft/2 + 1]
Return type: np.array