data

data_layer

Data layer classes

class data.data_layer.DataLayer(params, model, num_workers, worker_id)[source]

Bases: object

Abstract class from which all data layers must inherit.

__init__(params, model, num_workers, worker_id)[source]

Data layer constructor. The TensorFlow graph should not be created here, but rather in the self.build_graph() method.

Parameters:
  • params (dict) – parameters describing the data layer. All supported parameters are listed in get_required_params(), get_optional_params() functions.
  • model (instance of a class derived from Model) – parent model that created this data layer. Could be None if no model access is required for the use case.
  • num_workers (int) – number of Horovod processes or number of GPUs if Horovod is not used.
  • worker_id (int) – Horovod process id or GPU id if Horovod is not used.

Config parameters:

  • shuffle (bool) — whether to shuffle dataset after an epoch. Typically will be True for train and False for inference and evaluation.
  • dtype — data dtype. Could be either tf.float16 or tf.float32.
build_graph()[source]

Here all TensorFlow graph construction should happen.

create_feed_dict(model_in)[source]

A function that must be defined for data layers that support interactive infer. Given input which is an abstract data element to be defined by the data layer. The intended use is for the user to build and pass model_in from the jupyter notebook. Given model_in, the data layer must preprocess the raw data, and create the feed dict that defines the placeholders defined in create_interactive_placeholders().

create_interactive_placeholders()[source]

A function that must be defined for data layers that support interactive infer. This function is intended to create placeholders that will be passed to self._input_tensors that will be passed to the model.

static get_optional_params()[source]

Static method with description of optional parameters.

Returns:Dictionary containing all the parameters that can be included into the params parameter of the class __init__() method.
Return type:dict
static get_required_params()[source]

Static method with description of required parameters.

Returns:Dictionary containing all the parameters that have to be included into the params parameter of the class __init__() method.
Return type:dict
get_size_in_samples()[source]

Should return the dataset size in samples. That is, the number of objects in the dataset. This method is used to calculate a valid epoch size. If this method is not defined, you will need to make sure that your dataset for evaluation is created only for one epoch. You will also not be able to use num_epochs parameter in the base config.

Returns:dataset size in samples.
Return type:int
input_tensors

Dictionary containing input tensors. This dictionary has to define the following keys: source_tensors, which should contain all tensors describing the input object (i.e. tensors that are passed to the encoder, e.g. input sequence and input length). And when self.params['mode'] != "infer" data layer should also define target_tensors which is the list of all tensors related to the corresponding target object (i.e. tensors taht are passed to the decoder and loss, e.g. target sequence and target length). Note that all tensors have to be created inside self.build_graph() method.

iterator

tf.data.Dataset iterator. Should be created by self.build_graph().

params

Parameters used to construct the data layer (dictionary).

utils

data.utils.load_pre_existing_vocabulary(path, min_idx=0, read_chars=False)[source]

Loads pre-existing vocabulary into memory.

The vocabulary file should contain a token on each line with optional token count on the same line that will be ignored. Example:

a 1234
b 4321
c 32342
d
e
word 234
Parameters:
  • path (str) – path to vocabulary.
  • min_idx (int, optional) – minimum id to assign for a token.
  • read_chars (bool, optional) – whether to read only the first symbol of the line.
Returns:

vocabulary dictionary mapping tokens (chars/words) to int ids.

Return type:

dict

data.utils.pad_vocab_to_eight(vocab)[source]

Pads vocabulary so that it is divisible by 8.

Parameters:vocab (dict) – vocabulary in the form token->id
Returns:vocab with new tokens added if necessary, such that the total vocab size is divisible by 8.
Return type:dict