speech2text

speech2text

Data Layer for Speech-to-Text models

class data.speech2text.speech2text.Speech2TextDataLayer(params, model, num_workers, worker_id)[source]

Bases: open_seq2seq.data.data_layer.DataLayer

Speech-to-text data layer class.

__init__(params, model, num_workers, worker_id)[source]

Speech-to-text data layer constructor. See parent class for arguments description. Config parameters: * backend (str) — audio pre-processing backend

(‘psf’ [default] or librosa [recommended]).
  • num_audio_features (int) — number of audio features to extract.

  • input_type (str) — could be either “spectrogram” or “mfcc”.

  • vocab_file (str) — path to vocabulary file or sentencepiece model.

  • dataset_files (list) — list with paths to all dataset .csv files.

  • augmentation (dict) — optional dictionary with data augmentation parameters. Can contain “speed_perturbation_ratio”, “noise_level_min” and “noise_level_max” parameters, e.g.:

    {
      'speed_perturbation_ratio': 0.05,
      'noise_level_min': -90,
      'noise_level_max': -60,
    }
    

    For additional details on these parameters see data.speech2text.speech_utils.augment_audio_signal() function.

  • pad_to (int) — align audio sequence length to pad_to value.

  • max_duration (float) — drop all samples longer than max_duration (seconds)

  • min_duration (float) — drop all samples shorter than min_duration (seconds)

  • bpe (bool) — use BPE encodings

  • autoregressive (bool) — boolean indicating whether the model is autoregressive.

  • syn_enable (bool) — boolean indicating whether the model is using synthetic data.

  • syn_subdirs (list) — must be defined if using synthetic mode. Contains a list of subdirectories that hold the synthetica wav files.

  • window_size (float) — window’s duration (in seconds)

  • window_stride (float) — window’s stride (in seconds)

  • dither (float) — weight of Gaussian noise to apply to input signal for dithering/preventing quantization noise

  • num_fft (int) — size of fft window to use if features require fft,

    defaults to smallest power of 2 larger than window size

  • norm_per_feature (bool) — if True, the output features will be normalized (whitened) individually. if False, a global mean/std over all features will be used for normalization.

  • window (str) — window function to apply before FFT (‘hanning’, ‘hamming’, ‘none’)

  • num_fft (int) — optional FFT size

  • precompute_mel_basis (bool) — compute and store mel basis. If False, it will compute it for every get_speech_features call. Default: False

  • sample_freq (int) — required for precompute_mel_basis

_get_audio(wav)[source]

Parses audio from wav and returns array of audio features. :param wav: numpy array containing wav

Returns:source audio features as np.array, length of source sequence, sample id.
Return type:tuple
_parse_audio_element(id_and_audio_filename)[source]

Parses audio from file and returns array of audio features. :param id_and_audio_filename: tuple of sample id and corresponding

audio file name.
Returns:source audio features as np.array, length of source sequence, sample id.
Return type:tuple
_parse_audio_transcript_element(element)[source]

Parses tf.data element from TextLineDataset into audio and text. :param element: tf.data element from TextLineDataset.

Returns:source audio features as np.array, length of source sequence, target text as np.array of ids, target text length.
Return type:tuple
build_graph()[source]

Here all TensorFlow graph construction should happen.

create_feed_dict(model_in)[source]

Creates the feed dict for interactive infer

Parameters:model_in (str or np.array) – Either a str that contains the file path of the wav file, or a numpy array containing 1-d wav file.
Returns:Dictionary with values for the placeholders.
Return type:feed_dict (dict)
create_interactive_placeholders()[source]

A function that must be defined for data layers that support interactive infer. This function is intended to create placeholders that will be passed to self._input_tensors that will be passed to the model.

static get_optional_params()[source]

Static method with description of optional parameters.

Returns:Dictionary containing all the parameters that can be included into the params parameter of the class __init__() method.
Return type:dict
static get_required_params()[source]

Static method with description of required parameters.

Returns:Dictionary containing all the parameters that have to be included into the params parameter of the class __init__() method.
Return type:dict
get_size_in_samples()[source]

Returns the number of audio files.

input_tensors

Dictionary with input tensors. input_tensors["source_tensors"] contains:

  • source_sequence (shape=[batch_size x sequence length x num_audio_features])
  • source_length (shape=[batch_size])
input_tensors["target_tensors"] contains:
  • target_sequence (shape=[batch_size x sequence length])
  • target_length (shape=[batch_size])
iterator

Underlying tf.data iterator.

split_data(data)[source]

speech_commands

class data.speech2text.speech_commands.SpeechCommandsDataLayer(params, model, num_workers=None, worker_id=None)[source]

Bases: open_seq2seq.data.data_layer.DataLayer

__init__(params, model, num_workers=None, worker_id=None)[source]

ResNet Speech Commands data layer constructor.

Config parameters:

  • dataset_files (list) — list with paths to all dataset .csv files
  • dataset_location (str) — string with path to directory where .wavs are stored
  • num_audio_features (int) — number of spectrogram audio features and image length
  • audio_length (int) — cropping length of spectrogram and image width
  • num_labels (int) — number of classes in dataset
  • model_format (str) — determines input format, should be one of “jasper” or “resnet”
  • cache_data (bool) — cache the training data in the first epoch
  • augment_data (bool) — add time stretch and noise to training data
build_graph()[source]

Here all TensorFlow graph construction should happen.

static get_optional_params()[source]

Static method with description of optional parameters.

Returns:Dictionary containing all the parameters that can be included into the params parameter of the class __init__() method.
Return type:dict
static get_required_params()[source]

Static method with description of required parameters.

Returns:Dictionary containing all the parameters that have to be included into the params parameter of the class __init__() method.
Return type:dict
get_size_in_samples()[source]

Should return the dataset size in samples. That is, the number of objects in the dataset. This method is used to calculate a valid epoch size. If this method is not defined, you will need to make sure that your dataset for evaluation is created only for one epoch. You will also not be able to use num_epochs parameter in the base config.

Returns:dataset size in samples.
Return type:int
input_tensors

Dictionary containing input tensors. This dictionary has to define the following keys: source_tensors, which should contain all tensors describing the input object (i.e. tensors that are passed to the encoder, e.g. input sequence and input length). And when self.params['mode'] != "infer" data layer should also define target_tensors which is the list of all tensors related to the corresponding target object (i.e. tensors taht are passed to the decoder and loss, e.g. target sequence and target length). Note that all tensors have to be created inside self.build_graph() method.

iterator

tf.data.Dataset iterator. Should be created by self.build_graph().

parse_element(element)[source]

Reads an audio file and returns the augmented spectrogram image

preprocess_image(image)[source]

Crops or pads a spectrogram into a fixed dimension square image

split_data(data)[source]

speech_utils

exception data.speech2text.speech_utils.PreprocessOnTheFlyException[source]

Bases: Exception

Exception that is thrown to not load preprocessed features from disk; recompute on-the-fly. This saves disk space (if you’re experimenting with data input formats/preprocessing) but can be slower. The slowdown is especially apparent for small, fast NNs.

exception data.speech2text.speech_utils.RegenerateCacheException[source]

Bases: Exception

Exception that is thrown to force recomputation of (preprocessed) features

data.speech2text.speech_utils.augment_audio_signal(signal, sample_freq, augmentation)[source]

Function that performs audio signal augmentation.

Parameters:
  • signal (np.array) – np.array containing raw audio signal.
  • sample_freq (float) – frames per second.
  • augmentation (dict, optional) –

    None or dictionary of augmentation parameters. If not None, has to have ‘speed_perturbation_ratio’, ‘noise_level_min’, or ‘noise_level_max’ fields, e.g.:

    augmentation={
      'speed_perturbation_ratio': 0.2,
      'noise_level_min': -90,
      'noise_level_max': -46,
    }
    

    ’speed_perturbation_ratio’ can either be a list of possible speed perturbation factors or a float. If float, a random value from U[1-speed_perturbation_ratio, 1+speed_perturbation_ratio].

Returns:

np.array with augmented audio signal.

Return type:

np.array

data.speech2text.speech_utils.get_preprocessed_data_path(filename, params)[source]

Function to convert the audio path into the path to the preprocessed version of this audio :param : param filename: WAVE filename :param : param params: dictionary containing preprocessing parameters :param : return: path to new file (without extension). The path is :param generated from the relevant preprocessing parameters.:

data.speech2text.speech_utils.get_speech_features(signal, sample_freq, params)[source]

Get speech features using either librosa (recommended) or python_speech_features :param signal: np.array containing raw audio signal :type signal: np.array :param sample_freq: sample rate of the signal :type sample_freq: float :param params: parameters of pre-processing :type params: dict

Returns:np.array of audio features with shape=[num_time_steps, num_features]. audio_duration (float): duration of the signal in seconds
Return type:np.array
data.speech2text.speech_utils.get_speech_features_from_file(filename, params)[source]
Function to get a numpy array of features, from an audio file.
if params[‘cache_features’]==True, try load preprocessed data from disk, or store after preprocesseng. else, perform preprocessing on-the-fly.
Parameters:
  • filename (string) – WAVE filename.
  • params (dict) –

    the following parameters num_features (int): number of speech features in frequency domain. features_type (string): ‘mfcc’ or ‘spectrogram’. window_size (float): size of analysis window in milli-seconds. window_stride (float): stride of analysis window in milli-seconds. augmentation (dict, optional): dictionary of augmentation parameters. See

    augment_audio_signal() for specification and example.

    window (str): window function to apply dither (float): weight of Gaussian noise to apply to input signal for

    dithering/preventing quantization noise
    num_fft (int): size of fft window to use if features require fft,
    defaults to smallest power of 2 larger than window size
    norm_per_feature (bool): if True, the output features will be normalized
    (whitened) individually. if False, a global mean/std over all features will be used for normalization
Returns:

np.array of audio features with shape=[num_time_steps, num_features].

Return type:

np.array

data.speech2text.speech_utils.get_speech_features_librosa(signal, sample_freq, num_features, features_type='spectrogram', window_size=0.02, window_stride=0.01, augmentation=None, window_fn=<function hanning>, num_fft=None, dither=0.0, norm_per_feature=False, mel_basis=None)[source]

Function to convert raw audio signal to numpy array of features. Backend: librosa :param signal: np.array containing raw audio signal. :type signal: np.array :param sample_freq: frames per second. :type sample_freq: float :param num_features: number of speech features in frequency domain. :type num_features: int :param pad_to: if specified, the length will be padded to become divisible

by pad_to parameter.
Parameters:
  • features_type (string) – ‘mfcc’ or ‘spectrogram’.
  • window_size (float) – size of analysis window in milli-seconds.
  • window_stride (float) – stride of analysis window in milli-seconds.
  • augmentation (dict, optional) – dictionary of augmentation parameters. See augment_audio_signal() for specification and example.
Returns:

np.array of audio features with shape=[num_time_steps, num_features]. audio_duration (float): duration of the signal in seconds

Return type:

np.array

data.speech2text.speech_utils.get_speech_features_psf(signal, sample_freq, num_features, pad_to=8, features_type='spectrogram', window_size=0.02, window_stride=0.01, augmentation=None)[source]

Function to convert raw audio signal to numpy array of features. Backend: python_speech_features :param signal: np.array containing raw audio signal. :type signal: np.array :param sample_freq: frames per second. :type sample_freq: float :param num_features: number of speech features in frequency domain. :type num_features: int :param pad_to: if specified, the length will be padded to become divisible

by pad_to parameter.
Parameters:
  • features_type (string) – ‘mfcc’ or ‘spectrogram’.
  • window_size (float) – size of analysis window in milli-seconds.
  • window_stride (float) – stride of analysis window in milli-seconds.
  • augmentation (dict, optional) – dictionary of augmentation parameters. See augment_audio_signal() for specification and example.
  • apply_window (bool) – whether to apply Hann window for mfcc and logfbank. python_speech_features version should accept winfunc if it is True.
Returns:

np.array of audio features with shape=[num_time_steps, num_features]. audio_duration (float): duration of the signal in seconds

Return type:

np.array

data.speech2text.speech_utils.load_features(path, data_format)[source]

Function to load (preprocessed) features from disk

:param : param path: the path where the features are stored :param : param data_format: the format in which the features are stored :param : return: tuple of (features, duration)

data.speech2text.speech_utils.normalize_signal(signal)[source]

Normalize float32 signal to [-1, 1] range

data.speech2text.speech_utils.preemphasis(signal, coeff=0.97)[source]
data.speech2text.speech_utils.save_features(features, duration, path, data_format, verbose=False)[source]

Function to save (preprocessed) features to disk

:param : param features: features :param : param duration: metadata: duration in seconds of audio file :param : param path: path to store the data :param : param data_format: format to store the data in (‘npy’, :param ‘npz’,: :param ‘hdf5’):