text2text¶
t2t¶
Input pipeline for the transformer model to read, filter, and batch examples.
Two things to note in the pipeline:
Batching scheme
- The examples encoded in the TFRecord files contain data in the format:
- {“inputs”: [variable length array of integers],
“targets”: [variable length array of integers]}
Where integers in the arrays refer to tokens in the English and German vocab file (named vocab.ende.32768).
Prior to batching, elements in the dataset are grouped by length (max between “inputs” and “targets” length). Each group is then batched such that:
group_batch_size * length <= batch_size.
Another way to view batch_size is the maximum number of tokens in each batch.
- Once batched, each element in the dataset will have the shape:
- {“inputs”: [group_batch_size, padded_input_length],
“targets”: [group_batch_size, padded_target_length]}
Lengths are padded to the longest “inputs” or “targets” sequence in the batch (padded_input_length and padded_target_length can be different).
This batching scheme decreases the fraction of padding tokens per training batch, thus improving the training speed significantly.
Shuffling
While training, the dataset is shuffled in two places in the code. The first is the list of training files. Second, while reading records using parallel_interleave, the sloppy argument is used to generate randomness in the order of the examples.
Modified slightly to fit OpenSeq2Seq needs
-
data.text2text.t2t.
_batch_examples
(dataset, batch_size, max_length, pad_2_eight=True)[source]¶ Group examples by similar lengths, and return batched dataset.
Each batch of similar-length examples are padded to the same length, and may have different number of elements in each batch, such that:
group_batch_size * padded_length <= batch_size.This decreases the number of padding tokens per batch, which improves the training speed.
Parameters: - dataset – Dataset of unbatched examples.
- batch_size – Max number of tokens per batch of examples.
- max_length – Max number of tokens in an example input or target sequence.
Returns: Dataset of batched examples with similar lengths.
-
data.text2text.t2t.
_create_min_max_boundaries
(max_length, min_boundary=8, boundary_scale=1.1)[source]¶ Create min and max boundary lists up to max_length.
For example, when max_length=24, min_boundary=4 and boundary_scale=2, the returned values will be:
buckets_min = [0, 4, 8, 16, 24] buckets_max = [4, 8, 16, 24, 25]Parameters: - max_length – The maximum length of example in dataset.
- min_boundary – Minimum length in boundary.
- boundary_scale – Amount to scale consecutive boundaries in the list.
Returns: min and max boundary lists
-
data.text2text.t2t.
_filter_max_length
(example, max_length=256)[source]¶ Indicates whether the example’s length is lower than the maximum length.
-
data.text2text.t2t.
_get_example_length
(example)[source]¶ Returns the maximum length between the example inputs and targets.
-
data.text2text.t2t.
_parse_example
(serialized_example, pad_2_eight=False)[source]¶ Return inputs and targets Tensors from a serialized tf.Example.
-
data.text2text.t2t.
_read_and_batch_from_files
(file_pattern, batch_size, max_length, num_cpu_cores, shuffle, repeat, num_workers, worker_id, batch_in_tokens, pad2eight=True)[source]¶ Create dataset where each item is a dict of “inputs” and “targets”.
Parameters: - file_pattern – String used to match the input TFRecord files.
- batch_size – Maximum number of tokens per batch of examples
- max_length – Maximum number of tokens per example
- num_cpu_cores – Number of cpu cores for parallel input processing.
- shuffle – If true, randomizes order of elements.
- repeat – Number of times to repeat the dataset. If None, the dataset is repeated forever.
- num_workers – Number of workers or number of Horovod workers
- worker_id – Worker id or Horovod rank
- batch_in_tokens – whether to batch_size means amounts in tokens or sentence
- batching in tokens is more efficient as it reduces PADs. batching in (pairs.) –
- should be used in inference mode since order of (sentences) –
- is important (sentences) –
- pad2eight – if True, it will pad both dimensions to be divisible by 8
Returns: tf.data.Dataset object containing examples loaded from the files.
text2text¶
-
class
data.text2text.text2text.
ParallelTextDataLayer
(params, model, num_workers=1, worker_id=0)[source]¶ Bases:
open_seq2seq.data.data_layer.DataLayer
-
create_feed_dict
(model_in)[source]¶ Creates the feed dict for interactive infer
Parameters: model_in (str) – the string to be translated. Should be in bpe format. Returns: Dictionary with values for the placeholders. Return type: feed_dict (dict)
-
create_interactive_placeholders
()[source]¶ A function that must be defined for data layers that support interactive infer. This function is intended to create placeholders that will be passed to self._input_tensors that will be passed to the model.
-
static
get_optional_params
()[source]¶ Static method with description of optional parameters.
Returns: Dictionary containing all the parameters that can be included into the params
parameter of the class__init__()
method.Return type: dict
-
static
get_required_params
()[source]¶ Static method with description of required parameters.
Returns: Dictionary containing all the parameters that have to be included into the params
parameter of the class__init__()
method.Return type: dict
-
get_size_in_samples
()[source]¶ Should return the dataset size in samples. That is, the number of objects in the dataset. This method is used to calculate a valid epoch size. If this method is not defined, you will need to make sure that your dataset for evaluation is created only for one epoch. You will also not be able to use
num_epochs
parameter in the base config.Returns: dataset size in samples. Return type: int
-
input_tensors
¶ Dictionary containing input tensors. This dictionary has to define the following keys: source_tensors, which should contain all tensors describing the input object (i.e. tensors that are passed to the encoder, e.g. input sequence and input length). And when
self.params['mode'] != "infer"
data layer should also define target_tensors which is the list of all tensors related to the corresponding target object (i.e. tensors taht are passed to the decoder and loss, e.g. target sequence and target length). Note that all tensors have to be created insideself.build_graph()
method.
-
iterator
¶ tf.data.Dataset
iterator. Should be created byself.build_graph()
.
-
-
class
data.text2text.text2text.
SpecialTextTokens
[source]¶ Bases:
enum.Enum
An enumeration.
-
END_OF_CHOICE
= -100¶
-
EOS_ID
= 1¶
-
OUT_OF_BUCKET
= 1234567890¶
-
PAD_ID
= 0¶
-
S_ID
= 2¶
-
UNK_ID
= 3¶
-
-
class
data.text2text.text2text.
TransformerDataLayer
(params, model, num_workers=1, worker_id=0)[source]¶ Bases:
open_seq2seq.data.data_layer.DataLayer
Wraps Transformers data pipeline into the form for OpenSeq2Seq
-
static
get_optional_params
()[source]¶ Static method with description of optional parameters.
Returns: Dictionary containing all the parameters that can be included into the params
parameter of the class__init__()
method.Return type: dict
-
static
get_required_params
()[source]¶ Static method with description of required parameters.
Returns: Dictionary containing all the parameters that have to be included into the params
parameter of the class__init__()
method.Return type: dict
-
input_tensors
¶ Dictionary containing input tensors. This dictionary has to define the following keys: source_tensors, which should contain all tensors describing the input object (i.e. tensors that are passed to the encoder, e.g. input sequence and input length). And when
self.params['mode'] != "infer"
data layer should also define target_tensors which is the list of all tensors related to the corresponding target object (i.e. tensors taht are passed to the decoder and loss, e.g. target sequence and target length). Note that all tensors have to be created insideself.build_graph()
method.
-
iterator
¶ tf.data.Dataset
iterator. Should be created byself.build_graph()
.
-
static
tokenizer¶
Defines Subtokenizer class to encode and decode strings.
-
class
data.text2text.tokenizer.
Subtokenizer
(vocab_file, reserved_tokens=None)[source]¶ Bases:
object
Encodes and decodes strings to/from integer IDs.
-
__init__
(vocab_file, reserved_tokens=None)[source]¶ Initializes class, creating a vocab file if data_files is provided.
-
_subtoken_ids_to_tokens
(subtokens)[source]¶ Convert list of int subtoken ids to a list of string tokens.
-
static
init_from_files
(vocab_file, files, target_vocab_size, threshold, min_count=None, file_byte_limit=1000000.0, reserved_tokens=None)[source]¶ Create subtoken vocabulary based on files, and save vocab to file.
Parameters: - vocab_file – String name of vocab file to store subtoken vocabulary.
- files – List of file paths that will be used to generate vocabulary.
- target_vocab_size – target vocabulary size to generate.
- threshold – int threshold of vocabulary size to accept.
- min_count – int minimum count to use for generating the vocabulary. The min count is the minimum number of times a subtoken should appear in the files before it is added to the vocabulary. If set to none, this value is found using binary search.
- file_byte_limit – (Default 1e6) Maximum number of bytes of sample text that will be drawn from the files.
- reserved_tokens – List of string tokens that are guaranteed to be at the beginning of the subtoken vocabulary list.
Returns: Subtokenizer object
-
-
data.text2text.tokenizer.
_count_and_gen_subtokens
(token_counts, alphabet, subtoken_dict, max_subtoken_length)[source]¶ Count number of times subtokens appear, and generate new subtokens.
Parameters: - token_counts – dict mapping tokens to the number of times they appear in the original files.
- alphabet – list of allowed characters. Used to escape the tokens, which guarantees that all tokens can be split into subtokens.
- subtoken_dict – dict mapping subtokens to ids.
- max_subtoken_length – maximum length of subtoken in subtoken_dict.
Returns: A defaultdict mapping subtokens to the number of times they appear in the tokens. The dict may contain new subtokens.
-
data.text2text.tokenizer.
_count_tokens
(files, file_byte_limit=1000000.0)[source]¶ Return token counts of words in the files.
Samples file_byte_limit bytes from each file, and counts the words that appear in the samples. The samples are semi-evenly distributed across the file.
Parameters: - files – List of filepaths
- file_byte_limit – Max number of bytes that will be read from each file.
Returns: Dictionary mapping tokens to the number of times they appear in the sampled lines from the files.
-
data.text2text.tokenizer.
_escape_token
(token, alphabet)[source]¶ Replace characters that aren’t in the alphabet and append “_” to token.
- Apply three transformations to the token:
- Replace underline character “_” with “u”, and backslash “” with “".
- Replace characters outside of the alphabet with “###;”, where ### is the character’s Unicode code point.
- Appends “_” to mark the end of a token.
Parameters: - token – unicode string to be escaped
- alphabet – list of all known characters
Returns: escaped string
-
data.text2text.tokenizer.
_filter_and_bucket_subtokens
(subtoken_counts, min_count)[source]¶ Return a bucketed list of subtokens that are filtered by count.
Parameters: - subtoken_counts – defaultdict mapping subtokens to their counts
- min_count – int count used to filter subtokens
Returns: List of subtoken sets, where subtokens in set i have the same length=i.
-
data.text2text.tokenizer.
_gen_new_subtoken_list
(subtoken_counts, min_count, alphabet, reserved_tokens=None)[source]¶ Generate candidate subtokens ordered by count, and new max subtoken length.
Add subtokens to the candiate list in order of length (longest subtokens first). When a subtoken is added, the counts of each of its prefixes are decreased. Prefixes that don’t appear much outside the subtoken are not added to the candidate list.
- For example:
- subtoken being added to candidate list: ‘translate’ subtoken_counts: {‘translate’:10, ‘t’:40, ‘tr’:16, ‘tra’:12, …} min_count: 5
- When ‘translate’ is added, subtoken_counts is updated to:
- {‘translate’:0, ‘t’:30, ‘tr’:6, ‘tra’: 2, …}
The subtoken ‘tra’ will not be added to the candidate list, because it appears twice (less than min_count) outside of ‘translate’.
Parameters: - subtoken_counts – defaultdict mapping str subtokens to int counts
- min_count – int minumum count requirement for subtokens
- alphabet – set of characters. Each character is added to the subtoken list to guarantee that all tokens can be encoded.
- reserved_tokens – list of tokens that will be added to the beginning of the returned subtoken list.
Returns: List of candidate subtokens in decreasing count order, and maximum subtoken length
-
data.text2text.tokenizer.
_generate_alphabet_dict
(iterable, reserved_tokens=None)[source]¶ Create set of characters that appear in any element in the iterable.
-
data.text2text.tokenizer.
_generate_subtokens
(token_counts, alphabet, min_count, num_iterations=4, reserved_tokens=None)[source]¶ Create a list of subtokens in decreasing order of frequency.
Parameters: - token_counts – dict mapping str tokens -> int count
- alphabet – set of characters
- min_count – int minimum number of times a subtoken must appear before it is added to the vocabulary.
- num_iterations – int number of iterations to generate new tokens.
- reserved_tokens – list of tokens that will be added to the beginning to the returned subtoken list.
Returns: Sorted list of subtokens (most frequent first)
-
data.text2text.tokenizer.
_generate_subtokens_with_target_vocab_size
(token_counts, alphabet, target_size, threshold, min_count=None, reserved_tokens=None)[source]¶ Generate subtoken vocabulary close to the target size.
-
data.text2text.tokenizer.
_list_to_index_dict
(lst)[source]¶ Create dictionary mapping list items to their indices in the list.
-
data.text2text.tokenizer.
_load_vocab_file
(vocab_file, reserved_tokens=None)[source]¶ Load vocabulary while ensuring reserved tokens are at the top.
-
data.text2text.tokenizer.
_native_to_unicode
(s)[source]¶ Convert string to unicode (required in Python 2).
-
data.text2text.tokenizer.
_save_vocab_file
(vocab_file, subtoken_list)[source]¶ Save subtokens to file.
-
data.text2text.tokenizer.
_split_string_to_tokens
(text)[source]¶ Splits text to a list of string tokens.
-
data.text2text.tokenizer.
_split_token_to_subtokens
(token, subtoken_dict, max_subtoken_length)[source]¶ Splits a token into subtokens defined in the subtoken dict.
-
data.text2text.tokenizer.
_unicode_to_native
(s)[source]¶ Convert string from unicode to native format (required in Python 2).
-
data.text2text.tokenizer.
join_tokens_to_string
(tokens)[source]¶ Join a list of string tokens into a single string.
-
data.text2text.tokenizer.
unescape_token
(token)[source]¶ Replaces escaped characters in the token with their unescaped versions.
- Applies inverse transformations as _escape_token():
- Replace “u” with “_”, and “" with “”.
- Replace “###;” with the unicode character the ### refers to.
Parameters: token – escaped string Returns: unescaped string