text2speech

text2speech

class data.text2speech.text2speech.Text2SpeechDataLayer(params, model, num_workers=None, worker_id=None)[source]

Bases: open_seq2seq.data.data_layer.DataLayer

Text-to-speech data layer class

__init__(params, model, num_workers=None, worker_id=None)[source]

Text-to-speech data layer constructor.

See parent class for arguments description.

Config parameters:

  • dataset (str) — The dataset to use. Currently ‘LJ’ for the LJSpeech 1.1 dataset is supported.
  • num_audio_features (int) — number of audio features to extract.
  • output_type (str) — could be either “magnitude”, or “mel”.
  • vocab_file (str) — path to vocabulary file.
  • dataset_files (list) — list with paths to all dataset .csv files. File is assumed to be separated by “|”.
  • dataset_location (string) — string with path to directory where wavs are stored.
  • feature_normalize (bool) — whether to normlize the data with a preset mean and std
  • feature_normalize_mean (bool) — used for feature normalize. Defaults to 0.
  • feature_normalize_std (bool) — used for feature normalize. Defaults to 1.
  • mag_power (int) — the power to which the magnitude spectrogram is scaled to. Defaults to 1. 1 for energy spectrogram 2 for power spectrogram Defaults to 2.
  • pad_EOS (bool) — whether to apply EOS tokens to both the text and the speech signal. Will pad at least 1 token regardless of pad_to value. Defaults to True.
  • pad_value (float) — The value we pad the spectrogram with. Defaults to np.log(data_min).
  • pad_to (int) — we pad such that the resulting datapoint is a multiple of pad_to. Defaults to 8.
  • trim (bool) — Whether to trim silence via librosa or not. Defaults to False.
  • data_min (float) — min clip value prior to taking the log. Defaults to 1e-5. Please change to 1e-2 if using htk mels.
  • duration_min (int) — Minimum duration in steps for speech signal. All signals less than this will be cut from the training set. Defaults to 0.
  • duration_max (int) — Maximum duration in steps for speech signal. All signals greater than this will be cut from the training set. Defaults to 4000.
  • mel_type (str) — One of [‘slaney’, ‘htk’]. Decides which algorithm to use to compute mel specs. Defaults to htk.
  • style_input (str) — Can be either None or “wav”. Must be set to “wav” for GST. Defaults to None.
  • n_samples_train (int) — number of the shortest examples to use for training.
  • n_samples_eval (int) — number of the shortest examples to use for evaluation.
  • n_fft (int) — FFT window size.
  • fmax (float) — highest frequency to use.
  • max_normalization (bool) — whether to divide the final audio signal by its’ absolute maximum.
  • use_cache (bool) — whether to use cache.
_parse_audio_transcript_element(element)[source]

Parses tf.data element from TextLineDataset into audio and text.

Parameters:element – tf.data element from TextLineDataset.
Returns:text_input text as np.array of ids, text_input length, target audio features as np.array, stop token targets as np.array, length of target sequence.
Return type:tuple
_parse_transcript_element(transcript)[source]

Parses text from file and returns array of text features.

Parameters:transcript – the string to parse.
Returns:target text as np.array of ids, target text length.
Return type:tuple
build_graph()[source]

Here all TensorFlow graph construction should happen.

create_feed_dict(model_in)[source]

Creates the feed dict for interactive infer

Parameters:model_in (str) – The string to be spoken.
Returns:Dictionary with values for the placeholders.
Return type:feed_dict (dict)
create_interactive_placeholders()[source]

A function that must be defined for data layers that support interactive infer. This function is intended to create placeholders that will be passed to self._input_tensors that will be passed to the model.

get_magnitude_spec(spectrogram, is_mel=False)[source]

Returns an energy magnitude spectrogram. The processing depends on the data layer params.

Parameters:spectrogram – output spec from model
Returns:mag spec
Return type:mag_spec
static get_optional_params()[source]

Static method with description of optional parameters.

Returns:Dictionary containing all the parameters that can be included into the params parameter of the class __init__() method.
Return type:dict
static get_required_params()[source]

Static method with description of required parameters.

Returns:Dictionary containing all the parameters that have to be included into the params parameter of the class __init__() method.
Return type:dict
get_size_in_samples()[source]

Returns the number of audio files.

input_tensors

Dictionary containing input tensors. This dictionary has to define the following keys: source_tensors, which should contain all tensors describing the input object (i.e. tensors that are passed to the encoder, e.g. input sequence and input length). And when self.params['mode'] != "infer" data layer should also define target_tensors which is the list of all tensors related to the corresponding target object (i.e. tensors taht are passed to the decoder and loss, e.g. target sequence and target length). Note that all tensors have to be created inside self.build_graph() method.

iterator

tf.data.Dataset iterator. Should be created by self.build_graph().

n_fft
parse_text_output(text)[source]
sampling_rate
split_data(data)[source]

text2speech_wavenet

class data.text2speech.text2speech_wavenet.WavenetDataLayer(params, model, num_workers=None, worker_id=None)[source]

Bases: open_seq2seq.data.data_layer.DataLayer

Text to speech data layer class for Wavenet

__init__(params, model, num_workers=None, worker_id=None)[source]

Wavenet data layer constructor.

See parent class for arguments description.

Config parameters:

  • num_audio_features (int) — number of spectrogram audio features
  • dataset_files (list) — list with paths to all dataset .csv files
  • dataset_location (str) — string with path to directory where wavs are stored
_parse_audio_element(element)[source]

Parses tf.data element from TextLineDataset into audio.

build_graph()[source]

builds data reading graph

create_feed_dict(model_in)[source]

Creates the feed dict for interactive infer using a spectrogram

Parameters:model_in – tuple containing source audio, length of the source, conditioning spectrogram, length of the spectrogram, index of receptive field window
create_interactive_placeholders()[source]

A function that must be defined for data layers that support interactive infer. This function is intended to create placeholders that will be passed to self._input_tensors that will be passed to the model.

static get_optional_params()[source]

Static method with description of optional parameters.

Returns:Dictionary containing all the parameters that can be included into the params parameter of the class __init__() method.
Return type:dict
static get_required_params()[source]

Static method with description of required parameters.

Returns:Dictionary containing all the parameters that have to be included into the params parameter of the class __init__() method.
Return type:dict
get_size_in_samples()[source]

Should return the dataset size in samples. That is, the number of objects in the dataset. This method is used to calculate a valid epoch size. If this method is not defined, you will need to make sure that your dataset for evaluation is created only for one epoch. You will also not be able to use num_epochs parameter in the base config.

Returns:dataset size in samples.
Return type:int
input_tensors

Dictionary containing input tensors. This dictionary has to define the following keys: source_tensors, which should contain all tensors describing the input object (i.e. tensors that are passed to the encoder, e.g. input sequence and input length). And when self.params['mode'] != "infer" data layer should also define target_tensors which is the list of all tensors related to the corresponding target object (i.e. tensors taht are passed to the decoder and loss, e.g. target sequence and target length). Note that all tensors have to be created inside self.build_graph() method.

iterator

tf.data.Dataset iterator. Should be created by self.build_graph().

split_data(data)[source]

speech_utils

data.text2speech.speech_utils.denormalize(features, mean, std)[source]

Normalizes features with the specificed mean and std

data.text2speech.speech_utils.get_mel(log_mag_spec, fs=22050, n_fft=1024, n_mels=80, power=2.0, feature_normalize=False, mean=0, std=1, mel_basis=None, data_min=1e-05, htk=True, norm=None)[source]

Method to get mel spectrograms from magnitude spectrograms

Parameters:
  • log_mag_spec (np.array) – log of the magnitude spec
  • fs (int) – sampling frequency in Hz
  • n_fft (int) – size of fft window in samples
  • n_mels (int) – number of mel features
  • power (float) – power of the mag spectrogram
  • feature_normalize (bool) – whether the mag spec was normalized
  • mean (float) – normalization param of mag spec
  • std (float) – normalization param of mag spec
  • mel_basis (np.array) – optional pre-computed mel basis to save computational time if passed. If not passed, it will call librosa to construct one
  • data_min (float) – min clip value prior to taking the log.
  • htk (bool) – whther to compute the mel spec with the htk or slaney algorithm
  • norm – Should be None for htk, and 1 for slaney
Returns:

mel_spec with shape [time, n_mels]

Return type:

np.array

data.text2speech.speech_utils.get_speech_features(signal, fs, num_features, features_type='magnitude', n_fft=1024, hop_length=256, mag_power=2, feature_normalize=False, mean=0.0, std=1.0, data_min=1e-05, mel_basis=None)[source]

Helper function to retrieve spectrograms from loaded wav

Parameters:
  • signal – signal loaded with librosa.
  • fs (int) – sampling frequency in Hz.
  • num_features (int) – number of speech features in frequency domain.
  • features_type (string) – ‘magnitude’ or ‘mel’.
  • n_fft (int) – size of analysis window in samples.
  • hop_length (int) – stride of analysis window in samples.
  • mag_power (int) – power to raise magnitude spectrograms (prior to dot product with mel basis) 1 for energy spectrograms 2 fot power spectrograms
  • feature_normalize (bool) – whether to normalize the data with mean and std
  • mean (float) – if normalize is enabled, the mean to normalize to
  • std (float) – if normalize is enabled, the deviation to normalize to
  • data_min (float) – min clip value prior to taking the log.
Returns:

np.array of audio features with shape=[num_time_steps, num_features].

Return type:

np.array

data.text2speech.speech_utils.get_speech_features_from_file(filename, num_features, features_type='magnitude', n_fft=1024, hop_length=None, mag_power=2, feature_normalize=False, mean=0.0, std=1.0, trim=False, data_min=1e-05, return_raw_audio=False, return_audio_duration=False, augmentation=None, mel_basis=None)[source]

Helper function to retrieve spectrograms from wav files

Parameters:
  • filename (string) – WAVE filename.
  • num_features (int) – number of speech features in frequency domain.
  • features_type (string) – ‘magnitude’ or ‘mel’.
  • n_fft (int) – size of analysis window in samples.
  • hop_length (int) – stride of analysis window in samples.
  • mag_power (int) – power to raise magnitude spectrograms (prior to dot product with mel basis) 1 for energy spectrograms 2 fot power spectrograms
  • feature_normalize (bool) – whether to normalize the data with mean and std
  • mean (float) – if normalize is enabled, the mean to normalize to
  • std (float) – if normalize is enabled, the deviation to normalize to
  • trim (bool) – Whether to trim silence via librosa or not
  • data_min (float) – min clip value prior to taking the log.
Returns:

np.array of audio features with shape=[num_time_steps, num_features].

Return type:

np.array

data.text2speech.speech_utils.inverse_mel(log_mel_spec, fs=22050, n_fft=1024, n_mels=80, power=2.0, feature_normalize=False, mean=0, std=1, mel_basis=None, htk=True, norm=None)[source]

Reconstructs magnitude spectrogram from a mel spectrogram by multiplying it with the transposed mel basis.

Parameters:
  • log_mel_spec (np.array) – log of the mel spec
  • fs (int) – sampling frequency in Hz
  • n_fft (int) – size of fft window in samples
  • n_mels (int) – number of mel features
  • power (float) – power of the mag spectrogram that was used to generate the mel spec
  • feature_normalize (bool) – whether the mel spec was normalized
  • mean (float) – normalization param of mel spec
  • std (float) – normalization param of mel spec
  • mel_basis (np.array) – optional pre-computed mel basis to save computational time if passed. If not passed, it will call librosa to construct one
  • htk (bool) – whther to compute the mel spec with the htk or slaney algorithm
  • norm – Should be None for htk, and 1 for slaney
Returns:

mag_spec with shape [time, n_fft/2 + 1]

Return type:

np.array

data.text2speech.speech_utils.normalize(features, mean, std)[source]

Normalizes features with the specificed mean and std