text2speech¶

class data.text2speech.text2speech.Text2SpeechDataLayer(params, model, num_workers=None, worker_id=None)[source]¶

Bases: open_seq2seq.data.data_layer.DataLayer

Text-to-speech data layer class

__init__(params, model, num_workers=None, worker_id=None)[source]¶

Text-to-speech data layer constructor.

See parent class for arguments description.

Config parameters:

dataset (str) — The dataset to use. Currently ‘LJ’ for the LJSpeech 1.1 dataset is supported.
num_audio_features (int) — number of audio features to extract.
output_type (str) — could be either “magnitude”, or “mel”.
vocab_file (str) — path to vocabulary file.
dataset_files (list) — list with paths to all dataset .csv files. File is assumed to be separated by “|”.
dataset_location (string) — string with path to directory where wavs are stored.
feature_normalize (bool) — whether to normlize the data with a preset mean and std
feature_normalize_mean (bool) — used for feature normalize. Defaults to 0.
feature_normalize_std (bool) — used for feature normalize. Defaults to 1.
mag_power (int) — the power to which the magnitude spectrogram is scaled to. Defaults to 1. 1 for energy spectrogram 2 for power spectrogram Defaults to 2.
pad_EOS (bool) — whether to apply EOS tokens to both the text and the speech signal. Will pad at least 1 token regardless of pad_to value. Defaults to True.
pad_value (float) — The value we pad the spectrogram with. Defaults to np.log(data_min).
pad_to (int) — we pad such that the resulting datapoint is a multiple of pad_to. Defaults to 8.
trim (bool) — Whether to trim silence via librosa or not. Defaults to False.
data_min (float) — min clip value prior to taking the log. Defaults to 1e-5. Please change to 1e-2 if using htk mels.
duration_min (int) — Minimum duration in steps for speech signal. All signals less than this will be cut from the training set. Defaults to 0.
duration_max (int) — Maximum duration in steps for speech signal. All signals greater than this will be cut from the training set. Defaults to 4000.
mel_type (str) — One of [‘slaney’, ‘htk’]. Decides which algorithm to use to compute mel specs. Defaults to htk.
style_input (str) — Can be either None or “wav”. Must be set to “wav” for GST. Defaults to None.
n_samples_train (int) — number of the shortest examples to use for training.
n_samples_eval (int) — number of the shortest examples to use for evaluation.
n_fft (int) — FFT window size.
fmax (float) — highest frequency to use.
max_normalization (bool) — whether to divide the final audio signal by its’ absolute maximum.
use_cache (bool) — whether to use cache.

_parse_audio_transcript_element(element)[source]¶

Parses tf.data element from TextLineDataset into audio and text.

Parameters:	element – tf.data element from TextLineDataset.
Returns:	text_input text as np.array of ids, text_input length, target audio features as np.array, stop token targets as np.array, length of target sequence.
Return type:	tuple

_parse_transcript_element(transcript)[source]¶

Parses text from file and returns array of text features.

Parameters:	transcript – the string to parse.
Returns:	target text as np.array of ids, target text length.
Return type:	tuple

build_graph()[source]¶: Here all TensorFlow graph construction should happen.

create_feed_dict(model_in)[source]¶

Creates the feed dict for interactive infer

Parameters:	model_in (str) – The string to be spoken.
Returns:	Dictionary with values for the placeholders.
Return type:	feed_dict (dict)

create_interactive_placeholders()[source]¶: A function that must be defined for data layers that support interactive infer. This function is intended to create placeholders that will be passed to self._input_tensors that will be passed to the model.

get_magnitude_spec(spectrogram, is_mel=False)[source]¶

Returns an energy magnitude spectrogram. The processing depends on the data layer params.

Parameters:	spectrogram – output spec from model
Returns:	mag spec
Return type:	mag_spec

static get_optional_params()[source]¶

Static method with description of optional parameters.

Returns:	Dictionary containing all the parameters that can be included into the `params` parameter of the class `__init__()` method.
Return type:	dict

static get_required_params()[source]¶

Static method with description of required parameters.

Returns:	Dictionary containing all the parameters that have to be included into the `params` parameter of the class `__init__()` method.
Return type:	dict

get_size_in_samples()[source]¶: Returns the number of audio files.

input_tensors¶: Dictionary containing input tensors. This dictionary has to define the following keys: source_tensors, which should contain all tensors describing the input object (i.e. tensors that are passed to the encoder, e.g. input sequence and input length). And when self.params['mode'] != "infer" data layer should also define target_tensors which is the list of all tensors related to the corresponding target object (i.e. tensors taht are passed to the decoder and loss, e.g. target sequence and target length). Note that all tensors have to be created inside self.build_graph() method.

iterator¶: tf.data.Dataset iterator. Should be created by self.build_graph().

n_fft¶

parse_text_output(text)[source]¶

sampling_rate¶

split_data(data)[source]¶

text2speech_wavenet¶

class data.text2speech.text2speech_wavenet.WavenetDataLayer(params, model, num_workers=None, worker_id=None)[source]¶

Bases: open_seq2seq.data.data_layer.DataLayer

Text to speech data layer class for Wavenet

__init__(params, model, num_workers=None, worker_id=None)[source]¶

Wavenet data layer constructor.

See parent class for arguments description.

Config parameters:

num_audio_features (int) — number of spectrogram audio features
dataset_files (list) — list with paths to all dataset .csv files
dataset_location (str) — string with path to directory where wavs are stored

_parse_audio_element(element)[source]¶: Parses tf.data element from TextLineDataset into audio.

build_graph()[source]¶: builds data reading graph

create_feed_dict(model_in)[source]¶

Creates the feed dict for interactive infer using a spectrogram

Parameters:	model_in – tuple containing source audio, length of the source, conditioning spectrogram, length of the spectrogram, index of receptive field window

create_interactive_placeholders()[source]¶: A function that must be defined for data layers that support interactive infer. This function is intended to create placeholders that will be passed to self._input_tensors that will be passed to the model.

static get_optional_params()[source]¶

Static method with description of optional parameters.

Returns:	Dictionary containing all the parameters that can be included into the `params` parameter of the class `__init__()` method.
Return type:	dict

static get_required_params()[source]¶

Static method with description of required parameters.

Returns:	Dictionary containing all the parameters that have to be included into the `params` parameter of the class `__init__()` method.
Return type:	dict

get_size_in_samples()[source]¶

Should return the dataset size in samples. That is, the number of objects in the dataset. This method is used to calculate a valid epoch size. If this method is not defined, you will need to make sure that your dataset for evaluation is created only for one epoch. You will also not be able to use num_epochs parameter in the base config.

Returns:	dataset size in samples.
Return type:	int

input_tensors¶: Dictionary containing input tensors. This dictionary has to define the following keys: source_tensors, which should contain all tensors describing the input object (i.e. tensors that are passed to the encoder, e.g. input sequence and input length). And when self.params['mode'] != "infer" data layer should also define target_tensors which is the list of all tensors related to the corresponding target object (i.e. tensors taht are passed to the decoder and loss, e.g. target sequence and target length). Note that all tensors have to be created inside self.build_graph() method.

iterator¶: tf.data.Dataset iterator. Should be created by self.build_graph().

split_data(data)[source]¶

speech_utils¶

data.text2speech.speech_utils.denormalize(features, mean, std)[source]¶: Normalizes features with the specificed mean and std

data.text2speech.speech_utils.get_mel(log_mag_spec, fs=22050, n_fft=1024, n_mels=80, power=2.0, feature_normalize=False, mean=0, std=1, mel_basis=None, data_min=1e-05, htk=True, norm=None)[source]¶

Method to get mel spectrograms from magnitude spectrograms

Parameters:	log_mag_spec (np.array) – log of the magnitude spec fs (int) – sampling frequency in Hz n_fft (int) – size of fft window in samples n_mels (int) – number of mel features power (float) – power of the mag spectrogram feature_normalize (bool) – whether the mag spec was normalized mean (float) – normalization param of mag spec std (float) – normalization param of mag spec mel_basis (np.array) – optional pre-computed mel basis to save computational time if passed. If not passed, it will call librosa to construct one data_min (float) – min clip value prior to taking the log. htk (bool) – whther to compute the mel spec with the htk or slaney algorithm norm – Should be None for htk, and 1 for slaney
Returns:	mel_spec with shape [time, n_mels]
Return type:	np.array

data.text2speech.speech_utils.get_speech_features(signal, fs, num_features, features_type='magnitude', n_fft=1024, hop_length=256, mag_power=2, feature_normalize=False, mean=0.0, std=1.0, data_min=1e-05, mel_basis=None)[source]¶

Helper function to retrieve spectrograms from loaded wav

Parameters:	signal – signal loaded with librosa. fs (int) – sampling frequency in Hz. num_features (int) – number of speech features in frequency domain. features_type (string) – ‘magnitude’ or ‘mel’. n_fft (int) – size of analysis window in samples. hop_length (int) – stride of analysis window in samples. mag_power (int) – power to raise magnitude spectrograms (prior to dot product with mel basis) 1 for energy spectrograms 2 fot power spectrograms feature_normalize (bool) – whether to normalize the data with mean and std mean (float) – if normalize is enabled, the mean to normalize to std (float) – if normalize is enabled, the deviation to normalize to data_min (float) – min clip value prior to taking the log.
Returns:	np.array of audio features with shape=[num_time_steps, num_features].
Return type:	np.array

data.text2speech.speech_utils.get_speech_features_from_file(filename, num_features, features_type='magnitude', n_fft=1024, hop_length=None, mag_power=2, feature_normalize=False, mean=0.0, std=1.0, trim=False, data_min=1e-05, return_raw_audio=False, return_audio_duration=False, augmentation=None, mel_basis=None)[source]¶

Helper function to retrieve spectrograms from wav files

Parameters:	filename (string) – WAVE filename. num_features (int) – number of speech features in frequency domain. features_type (string) – ‘magnitude’ or ‘mel’. n_fft (int) – size of analysis window in samples. hop_length (int) – stride of analysis window in samples. mag_power (int) – power to raise magnitude spectrograms (prior to dot product with mel basis) 1 for energy spectrograms 2 fot power spectrograms feature_normalize (bool) – whether to normalize the data with mean and std mean (float) – if normalize is enabled, the mean to normalize to std (float) – if normalize is enabled, the deviation to normalize to trim (bool) – Whether to trim silence via librosa or not data_min (float) – min clip value prior to taking the log.
Returns:	np.array of audio features with shape=[num_time_steps, num_features].
Return type:	np.array

data.text2speech.speech_utils.inverse_mel(log_mel_spec, fs=22050, n_fft=1024, n_mels=80, power=2.0, feature_normalize=False, mean=0, std=1, mel_basis=None, htk=True, norm=None)[source]¶

Reconstructs magnitude spectrogram from a mel spectrogram by multiplying it with the transposed mel basis.

Parameters:	log_mel_spec (np.array) – log of the mel spec fs (int) – sampling frequency in Hz n_fft (int) – size of fft window in samples n_mels (int) – number of mel features power (float) – power of the mag spectrogram that was used to generate the mel spec feature_normalize (bool) – whether the mel spec was normalized mean (float) – normalization param of mel spec std (float) – normalization param of mel spec mel_basis (np.array) – optional pre-computed mel basis to save computational time if passed. If not passed, it will call librosa to construct one htk (bool) – whther to compute the mel spec with the htk or slaney algorithm norm – Should be None for htk, and 1 for slaney
Returns:	mag_spec with shape [time, n_fft/2 + 1]
Return type:	np.array

data.text2speech.speech_utils.normalize(features, mean, std)[source]¶: Normalizes features with the specificed mean and std