speech2text¶
speech2text¶
Data Layer for Speech-to-Text models
-
class
data.speech2text.speech2text.
Speech2TextDataLayer
(params, model, num_workers, worker_id)[source]¶ Bases:
open_seq2seq.data.data_layer.DataLayer
Speech-to-text data layer class.
-
__init__
(params, model, num_workers, worker_id)[source]¶ Speech-to-text data layer constructor. See parent class for arguments description. Config parameters: * backend (str) — audio pre-processing backend
(‘psf’ [default] or librosa [recommended]).num_audio_features (int) — number of audio features to extract.
input_type (str) — could be either “spectrogram” or “mfcc”.
vocab_file (str) — path to vocabulary file or sentencepiece model.
dataset_files (list) — list with paths to all dataset .csv files.
augmentation (dict) — optional dictionary with data augmentation parameters. Can contain “speed_perturbation_ratio”, “noise_level_min” and “noise_level_max” parameters, e.g.:
{ 'speed_perturbation_ratio': 0.05, 'noise_level_min': -90, 'noise_level_max': -60, }
For additional details on these parameters see
data.speech2text.speech_utils.augment_audio_signal()
function.pad_to (int) — align audio sequence length to pad_to value.
max_duration (float) — drop all samples longer than max_duration (seconds)
min_duration (float) — drop all samples shorter than min_duration (seconds)
bpe (bool) — use BPE encodings
autoregressive (bool) — boolean indicating whether the model is autoregressive.
syn_enable (bool) — boolean indicating whether the model is using synthetic data.
syn_subdirs (list) — must be defined if using synthetic mode. Contains a list of subdirectories that hold the synthetica wav files.
window_size (float) — window’s duration (in seconds)
window_stride (float) — window’s stride (in seconds)
dither (float) — weight of Gaussian noise to apply to input signal for dithering/preventing quantization noise
- num_fft (int) — size of fft window to use if features require fft,
defaults to smallest power of 2 larger than window size
norm_per_feature (bool) — if True, the output features will be normalized (whitened) individually. if False, a global mean/std over all features will be used for normalization.
window (str) — window function to apply before FFT (‘hanning’, ‘hamming’, ‘none’)
num_fft (int) — optional FFT size
precompute_mel_basis (bool) — compute and store mel basis. If False, it will compute it for every get_speech_features call. Default: False
sample_freq (int) — required for precompute_mel_basis
-
_get_audio
(wav)[source]¶ Parses audio from wav and returns array of audio features. :param wav: numpy array containing wav
Returns: source audio features as np.array
, length of source sequence, sample id.Return type: tuple
-
_parse_audio_element
(id_and_audio_filename)[source]¶ Parses audio from file and returns array of audio features. :param id_and_audio_filename: tuple of sample id and corresponding
audio file name.Returns: source audio features as np.array
, length of source sequence, sample id.Return type: tuple
-
_parse_audio_transcript_element
(element)[source]¶ Parses tf.data element from TextLineDataset into audio and text. :param element: tf.data element from TextLineDataset.
Returns: source audio features as np.array
, length of source sequence, target text as np.array of ids, target text length.Return type: tuple
-
create_feed_dict
(model_in)[source]¶ Creates the feed dict for interactive infer
Parameters: model_in (str or np.array) – Either a str that contains the file path of the wav file, or a numpy array containing 1-d wav file. Returns: Dictionary with values for the placeholders. Return type: feed_dict (dict)
-
create_interactive_placeholders
()[source]¶ A function that must be defined for data layers that support interactive infer. This function is intended to create placeholders that will be passed to self._input_tensors that will be passed to the model.
-
static
get_optional_params
()[source]¶ Static method with description of optional parameters.
Returns: Dictionary containing all the parameters that can be included into the params
parameter of the class__init__()
method.Return type: dict
-
static
get_required_params
()[source]¶ Static method with description of required parameters.
Returns: Dictionary containing all the parameters that have to be included into the params
parameter of the class__init__()
method.Return type: dict
-
input_tensors
¶ Dictionary with input tensors.
input_tensors["source_tensors"]
contains:- source_sequence (shape=[batch_size x sequence length x num_audio_features])
- source_length (shape=[batch_size])
input_tensors["target_tensors"]
contains:- target_sequence (shape=[batch_size x sequence length])
- target_length (shape=[batch_size])
-
iterator
¶ Underlying tf.data iterator.
-
speech_commands¶
-
class
data.speech2text.speech_commands.
SpeechCommandsDataLayer
(params, model, num_workers=None, worker_id=None)[source]¶ Bases:
open_seq2seq.data.data_layer.DataLayer
-
__init__
(params, model, num_workers=None, worker_id=None)[source]¶ ResNet Speech Commands data layer constructor.
Config parameters:
- dataset_files (list) — list with paths to all dataset .csv files
- dataset_location (str) — string with path to directory where .wavs are stored
- num_audio_features (int) — number of spectrogram audio features and image length
- audio_length (int) — cropping length of spectrogram and image width
- num_labels (int) — number of classes in dataset
- model_format (str) — determines input format, should be one of “jasper” or “resnet”
- cache_data (bool) — cache the training data in the first epoch
- augment_data (bool) — add time stretch and noise to training data
-
static
get_optional_params
()[source]¶ Static method with description of optional parameters.
Returns: Dictionary containing all the parameters that can be included into the params
parameter of the class__init__()
method.Return type: dict
-
static
get_required_params
()[source]¶ Static method with description of required parameters.
Returns: Dictionary containing all the parameters that have to be included into the params
parameter of the class__init__()
method.Return type: dict
-
get_size_in_samples
()[source]¶ Should return the dataset size in samples. That is, the number of objects in the dataset. This method is used to calculate a valid epoch size. If this method is not defined, you will need to make sure that your dataset for evaluation is created only for one epoch. You will also not be able to use
num_epochs
parameter in the base config.Returns: dataset size in samples. Return type: int
-
input_tensors
¶ Dictionary containing input tensors. This dictionary has to define the following keys: source_tensors, which should contain all tensors describing the input object (i.e. tensors that are passed to the encoder, e.g. input sequence and input length). And when
self.params['mode'] != "infer"
data layer should also define target_tensors which is the list of all tensors related to the corresponding target object (i.e. tensors taht are passed to the decoder and loss, e.g. target sequence and target length). Note that all tensors have to be created insideself.build_graph()
method.
-
iterator
¶ tf.data.Dataset
iterator. Should be created byself.build_graph()
.
-
speech_utils¶
-
exception
data.speech2text.speech_utils.
PreprocessOnTheFlyException
[source]¶ Bases:
Exception
Exception that is thrown to not load preprocessed features from disk; recompute on-the-fly. This saves disk space (if you’re experimenting with data input formats/preprocessing) but can be slower. The slowdown is especially apparent for small, fast NNs.
-
exception
data.speech2text.speech_utils.
RegenerateCacheException
[source]¶ Bases:
Exception
Exception that is thrown to force recomputation of (preprocessed) features
-
data.speech2text.speech_utils.
augment_audio_signal
(signal, sample_freq, augmentation)[source]¶ Function that performs audio signal augmentation.
Parameters: - signal (np.array) – np.array containing raw audio signal.
- sample_freq (float) – frames per second.
- augmentation (dict, optional) –
None or dictionary of augmentation parameters. If not None, has to have ‘speed_perturbation_ratio’, ‘noise_level_min’, or ‘noise_level_max’ fields, e.g.:
augmentation={ 'speed_perturbation_ratio': 0.2, 'noise_level_min': -90, 'noise_level_max': -46, }
’speed_perturbation_ratio’ can either be a list of possible speed perturbation factors or a float. If float, a random value from U[1-speed_perturbation_ratio, 1+speed_perturbation_ratio].
Returns: np.array with augmented audio signal.
Return type: np.array
-
data.speech2text.speech_utils.
get_preprocessed_data_path
(filename, params)[source]¶ Function to convert the audio path into the path to the preprocessed version of this audio :param : param filename: WAVE filename :param : param params: dictionary containing preprocessing parameters :param : return: path to new file (without extension). The path is :param generated from the relevant preprocessing parameters.:
-
data.speech2text.speech_utils.
get_speech_features
(signal, sample_freq, params)[source]¶ Get speech features using either librosa (recommended) or python_speech_features :param signal: np.array containing raw audio signal :type signal: np.array :param sample_freq: sample rate of the signal :type sample_freq: float :param params: parameters of pre-processing :type params: dict
Returns: np.array of audio features with shape=[num_time_steps, num_features]. audio_duration (float): duration of the signal in seconds Return type: np.array
-
data.speech2text.speech_utils.
get_speech_features_from_file
(filename, params)[source]¶ - Function to get a numpy array of features, from an audio file.
- if params[‘cache_features’]==True, try load preprocessed data from disk, or store after preprocesseng. else, perform preprocessing on-the-fly.
Parameters: - filename (string) – WAVE filename.
- params (dict) –
the following parameters num_features (int): number of speech features in frequency domain. features_type (string): ‘mfcc’ or ‘spectrogram’. window_size (float): size of analysis window in milli-seconds. window_stride (float): stride of analysis window in milli-seconds. augmentation (dict, optional): dictionary of augmentation parameters. See
augment_audio_signal()
for specification and example.window (str): window function to apply dither (float): weight of Gaussian noise to apply to input signal for
dithering/preventing quantization noise- num_fft (int): size of fft window to use if features require fft,
- defaults to smallest power of 2 larger than window size
- norm_per_feature (bool): if True, the output features will be normalized
- (whitened) individually. if False, a global mean/std over all features will be used for normalization
Returns: np.array of audio features with shape=[num_time_steps, num_features].
Return type: np.array
-
data.speech2text.speech_utils.
get_speech_features_librosa
(signal, sample_freq, num_features, features_type='spectrogram', window_size=0.02, window_stride=0.01, augmentation=None, window_fn=<function hanning>, num_fft=None, dither=0.0, norm_per_feature=False, mel_basis=None)[source]¶ Function to convert raw audio signal to numpy array of features. Backend: librosa :param signal: np.array containing raw audio signal. :type signal: np.array :param sample_freq: frames per second. :type sample_freq: float :param num_features: number of speech features in frequency domain. :type num_features: int :param pad_to: if specified, the length will be padded to become divisible
bypad_to
parameter.Parameters: - features_type (string) – ‘mfcc’ or ‘spectrogram’.
- window_size (float) – size of analysis window in milli-seconds.
- window_stride (float) – stride of analysis window in milli-seconds.
- augmentation (dict, optional) – dictionary of augmentation parameters. See
augment_audio_signal()
for specification and example.
Returns: np.array of audio features with shape=[num_time_steps, num_features]. audio_duration (float): duration of the signal in seconds
Return type: np.array
-
data.speech2text.speech_utils.
get_speech_features_psf
(signal, sample_freq, num_features, pad_to=8, features_type='spectrogram', window_size=0.02, window_stride=0.01, augmentation=None)[source]¶ Function to convert raw audio signal to numpy array of features. Backend: python_speech_features :param signal: np.array containing raw audio signal. :type signal: np.array :param sample_freq: frames per second. :type sample_freq: float :param num_features: number of speech features in frequency domain. :type num_features: int :param pad_to: if specified, the length will be padded to become divisible
bypad_to
parameter.Parameters: - features_type (string) – ‘mfcc’ or ‘spectrogram’.
- window_size (float) – size of analysis window in milli-seconds.
- window_stride (float) – stride of analysis window in milli-seconds.
- augmentation (dict, optional) – dictionary of augmentation parameters. See
augment_audio_signal()
for specification and example. - apply_window (bool) – whether to apply Hann window for mfcc and logfbank. python_speech_features version should accept winfunc if it is True.
Returns: np.array of audio features with shape=[num_time_steps, num_features]. audio_duration (float): duration of the signal in seconds
Return type: np.array
-
data.speech2text.speech_utils.
load_features
(path, data_format)[source]¶ Function to load (preprocessed) features from disk
:param : param path: the path where the features are stored :param : param data_format: the format in which the features are stored :param : return: tuple of (features, duration)
-
data.speech2text.speech_utils.
normalize_signal
(signal)[source]¶ Normalize float32 signal to [-1, 1] range
-
data.speech2text.speech_utils.
save_features
(features, duration, path, data_format, verbose=False)[source]¶ Function to save (preprocessed) features to disk
:param : param features: features :param : param duration: metadata: duration in seconds of audio file :param : param path: path to store the data :param : param data_format: format to store the data in (‘npy’, :param ‘npz’,: :param ‘hdf5’):