transformer¶
attention_layer¶
Implementation of multiheaded attention and self-attention layers.
- 
class parts.transformer.attention_layer.Attention(hidden_size, num_heads, attention_dropout, train, mode='loung', regularizer=None, window_size=None, back_step_size=None)[source]¶
- Bases: - tensorflow.python.layers.base.Layer- Multi-headed attention layer. - 
call(x, y, bias, cache=None, positions=None)[source]¶
- Apply attention mechanism to x and y. - Parameters: - x – a tensor with shape [batch_size, length_x, hidden_size]
- y – a tensor with shape [batch_size, length_y, hidden_size]
- bias – attention bias that will be added to the result of the dot product.
- cache – (Used during prediction) dictionary with tensors containing results of previous attentions. The dictionary must have the items: - {“k”: tensor with shape [batch_size, i, key_channels],
- ”v”: tensor with shape [batch_size, i, value_channels]}
 where i is the current decoded length. 
- positions – decoder-encoder alignment for previous steps [batch_size, n_heads, length_x]
 - Returns: - Attention layer output with shape [batch_size, length_x, hidden_size] 
 - 
combine_heads(x)[source]¶
- Combine tensor that has been split. - Parameters: - x – A tensor [batch_size, num_heads, length, hidden_size/num_heads] - Returns: - A tensor with shape [batch_size, length, hidden_size] 
 - 
split_heads(x)[source]¶
- Split x into different heads, and transpose the resulting value. - The tensor is transposed to insure the inner dimensions hold the correct values during the matrix multiplication. - Parameters: - x – A tensor with shape [batch_size, length, hidden_size] - Returns: - A tensor with shape [batch_size, num_heads, length, hidden_size/num_heads] 
 
- 
- 
class parts.transformer.attention_layer.SelfAttention(hidden_size, num_heads, attention_dropout, train, mode='loung', regularizer=None, window_size=None, back_step_size=None)[source]¶
- Bases: - parts.transformer.attention_layer.Attention- Multiheaded self-attention layer. - 
call(x, bias, cache=None)[source]¶
- Apply attention mechanism to x and y. - Parameters: - x – a tensor with shape [batch_size, length_x, hidden_size]
- y – a tensor with shape [batch_size, length_y, hidden_size]
- bias – attention bias that will be added to the result of the dot product.
- cache – (Used during prediction) dictionary with tensors containing results of previous attentions. The dictionary must have the items: - {“k”: tensor with shape [batch_size, i, key_channels],
- ”v”: tensor with shape [batch_size, i, value_channels]}
 where i is the current decoded length. 
- positions – decoder-encoder alignment for previous steps [batch_size, n_heads, length_x]
 - Returns: - Attention layer output with shape [batch_size, length_x, hidden_size] 
 
- 
beam_search¶
Beam search to find the translated sequence with the highest probability.
Source implementation from Tensor2Tensor: https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/beam_search.py
- 
class parts.transformer.beam_search.SequenceBeamSearch(symbols_to_logits_fn, vocab_size, batch_size, beam_size, alpha, max_decode_length, eos_id)[source]¶
- Bases: - object- Implementation of beam search loop. - 
_continue_search(state)[source]¶
- Return whether to continue the search loop. - The loops should terminate when
- when decode length has been reached, or
- when the worst score in the finished sequences is better than the best score in the alive sequences (i.e. the finished sequences are provably unchanging)
 
 - Parameters: - state – A dictionary with the current loop state. - Returns: - Bool tensor with value True if loop should continue, False if loop should terminate. 
 - 
_create_initial_state(initial_ids, initial_cache)[source]¶
- Return initial state dictionary and its shape invariants. - Parameters: - initial_ids – initial ids to pass into the symbols_to_logits_fn. int tensor with shape [batch_size, 1]
- initial_cache – dictionary storing values to be passed into the symbols_to_logits_fn.
 - Returns: - state and shape invariant dictionaries with keys from _StateKeys 
 - 
_get_new_alive_state(new_seq, new_log_probs, new_cache)[source]¶
- Gather the top k sequences that are still alive. - Parameters: - new_seq – New sequences generated by growing the current alive sequences int32 tensor with shape [batch_size, 2 * beam_size, cur_index + 1]
- new_log_probs – Log probabilities of new sequences float32 tensor with shape [batch_size, beam_size]
- new_cache – Dict of cached values for each sequence.
 - Returns: - {Top beam_size sequences that are still alive (don’t end with eos_id)
- Log probabilities of top alive sequences Dict cache storing decoder states for top alive sequences} 
 - Return type: - Dictionary with alive keys from _StateKeys 
 - 
_get_new_finished_state(state, new_seq, new_log_probs)[source]¶
- Combine new and old finished sequences, and gather the top k sequences. - Parameters: - state – A dictionary with the current loop state.
- new_seq – New sequences generated by growing the current alive sequences int32 tensor with shape [batch_size, beam_size, i + 1]
- new_log_probs – Log probabilities of new sequences float32 tensor with shape [batch_size, beam_size]
 - Returns: - {Top beam_size finished sequences based on score,
- Scores of finished sequences, Finished flags of finished sequences} 
 - Return type: - Dictionary with finished keys from _StateKeys 
 - 
_grow_alive_seq(state)[source]¶
- Grow alive sequences by one token, and collect top 2*beam_size sequences. - 2*beam_size sequences are collected because some sequences may have reached the EOS token. 2*beam_size ensures that at least beam_size sequences are still alive. - Parameters: - state – A dictionary with the current loop state. - Returns: - Tuple of (Top 2*beam_size sequences [batch_size, 2 * beam_size, cur_index + 1], Scores of returned sequences [batch_size, 2 * beam_size], New alive cache, for each of the 2 * beam_size sequences)
 - 
_search_step(state)[source]¶
- Beam search loop body. - Grow alive sequences by a single ID. Sequences that have reached the EOS token are marked as finished. The alive and finished sequences with the highest log probabilities and scores are returned. - A sequence’s finished score is calculating by dividing the log probability by the length normalization factor. Without length normalization, the search is more likely to return shorter sequences. - Parameters: - state – A dictionary with the current loop state. - Returns: - new state dictionary. 
 
- 
- 
class parts.transformer.beam_search._StateKeys[source]¶
- Bases: - object- Keys to dictionary storing the state of the beam search loop. - 
ALIVE_CACHE= 'ALIVE_CACHE'¶
 - 
ALIVE_LOG_PROBS= 'ALIVE_LOG_PROBS'¶
 - 
ALIVE_SEQ= 'ALIVE_SEQ'¶
 - 
CUR_INDEX= 'CUR_INDEX'¶
 - 
FINISHED_FLAGS= 'FINISHED_FLAGS'¶
 - 
FINISHED_SCORES= 'FINISHED_SCORES'¶
 - 
FINISHED_SEQ= 'FINISHED_SEQ'¶
 
- 
- 
parts.transformer.beam_search._expand_to_beam_size(tensor, beam_size)[source]¶
- Tiles a given tensor by beam_size. - Parameters: - tensor – tensor to tile [batch_size, …]
- beam_size – How much to tile the tensor by.
 - Returns: - Tiled tensor [batch_size, beam_size, …] 
- 
parts.transformer.beam_search._flatten_beam_dim(tensor)[source]¶
- Reshapes first two dimensions in to single dimension. - Parameters: - tensor – Tensor to reshape of shape [A, B, …] - Returns: - Reshaped tensor of shape [A*B, …] 
- 
parts.transformer.beam_search._gather_beams(nested, beam_indices, batch_size, new_beam_size)[source]¶
- Gather beams from nested structure of tensors. - Each tensor in nested represents a batch of beams, where beam refers to a single search state (beam search involves searching through multiple states in parallel). - This function is used to gather the top beams, specified by beam_indices, from the nested tensors. - Parameters: - nested – Nested structure (tensor, list, tuple or dict) containing tensors with shape [batch_size, beam_size, …].
- beam_indices – int32 tensor with shape [batch_size, new_beam_size]. Each value in beam_indices must be between [0, beam_size), and are not necessarily unique.
- batch_size – int size of batch
- new_beam_size – int number of beams to be pulled from the nested tensors.
 - Returns: - Nested structure containing tensors with shape
- [batch_size, new_beam_size, …] 
 
- 
parts.transformer.beam_search._gather_topk_beams(nested, score_or_log_prob, batch_size, beam_size)[source]¶
- Gather top beams from nested structure. 
- 
parts.transformer.beam_search._length_normalization(alpha, length)[source]¶
- Return length normalization factor. 
- 
parts.transformer.beam_search._shape_list(tensor)[source]¶
- Return a list of the tensor’s shape, and ensure no None values in list. 
- 
parts.transformer.beam_search._unflatten_beam_dim(tensor, batch_size, beam_size)[source]¶
- Reshapes first dimension back to [batch_size, beam_size]. - Parameters: - tensor – Tensor to reshape of shape [batch_size*beam_size, …]
- batch_size – Tensor, original batch size.
- beam_size – int, original beam size.
 - Returns: - Reshaped tensor of shape [batch_size, beam_size, …] 
- 
parts.transformer.beam_search.sequence_beam_search(symbols_to_logits_fn, initial_ids, initial_cache, vocab_size, beam_size, alpha, max_decode_length, eos_id)[source]¶
- Search for sequence of subtoken ids with the largest probability. - Parameters: - symbols_to_logits_fn – A function that takes in ids, index, and cache as arguments. The passed in arguments will have shape: ids -> [batch_size * beam_size, index] index -> [] (scalar) cache -> nested dictionary of tensors [batch_size * beam_size, …]- The function must return logits and new cache.
- logits -> [batch * beam_size, vocab_size] new cache -> same shape/structure as inputted cache
 
- initial_ids – Starting ids for each batch item. int32 tensor with shape [batch_size]
- initial_cache – dict containing starting decoder variables information
- vocab_size – int size of tokens
- beam_size – int number of beams
- alpha – float defining the strength of length normalization
- max_decode_length – maximum length to decoded sequence
- eos_id – int id of eos token, used to determine when a sequence has finished
 - Returns: - Top decoded sequences [batch_size, beam_size, max_decode_length] sequence scores [batch_size, beam_size] 
- symbols_to_logits_fn – 
common¶
- 
class parts.transformer.common.LayerNormalization(hidden_size, params={})[source]¶
- Bases: - tensorflow.python.layers.base.Layer- Layer normalization for BTC format: supports L2(default) and L1 modes 
- 
class parts.transformer.common.PrePostProcessingWrapper(layer, params, training)[source]¶
- Bases: - object- Wrapper around layer, that applies pre-processing and post-processing. 
embedding_layer¶
Implementation of embedding layer with shared weights.
- Bases: - tensorflow.python.layers.base.Layer- Calculates input embeddings and pre-softmax linear with shared weights. - Creates the variables of the layer. 
 - Get token embeddings of x. - Parameters: - x – An int64 tensor with shape [batch_size, length] - Returns: - float32 tensor with shape [batch_size, length, embedding_size] padding: float32 tensor with shape [batch_size, length] indicating the locations of the padding tokens in x.- Return type: - embeddings 
 - Computes logits by running x through a linear layer. - Parameters: - x – A float32 tensor with shape [batch_size, length, hidden_size] - Returns: - float32 tensor with shape [batch_size, length, vocab_size]. 
 
ffn_layer¶
Implementation of fully connected network.
utils¶
Transformer model helper methods.
- 
parts.transformer.utils.get_decoder_self_attention_bias(length, dtype=tf.float32)[source]¶
- Calculate bias for decoder that maintains model’s autoregressive property. - Creates a tensor that masks out locations that correspond to illegal connections, so prediction at position i cannot draw information from future positions. - Parameters: - length – int length of sequences in batch. - Returns: - float tensor of shape [1, 1, length, length] 
- 
parts.transformer.utils.get_padding(x, padding_value=0, dtype=tf.float32)[source]¶
- Return float tensor representing the padding values in x. - Parameters: - x – int tensor with any shape
- padding_value – int value that
- dtype – type of the output
 - Returns: - float tensor with same shape as x containing values 0 or 1.
- 0 -> non-padding, 1 -> padding 
 
- 
parts.transformer.utils.get_padding_bias(x, res_rank=4, pad_sym=0, dtype=tf.float32)[source]¶
- Calculate bias tensor from padding values in tensor. - Bias tensor that is added to the pre-softmax multi-headed attention logits, which has shape [batch_size, num_heads, length, length]. The tensor is zero at non-padding locations, and -1e9 (negative infinity) at padding locations. - Parameters: - x – int tensor with shape [batch_size, length]
- res_rank – int indicates the rank of attention_bias.
- dtype – type of the output attention_bias
- pad_sym – int the symbol used for padding
 - Returns: - Attention bias tensor of shape [batch_size, 1, 1, length] if res_rank = 4 - for Transformer or [batch_size, 1, length] if res_rank = 3 - for ConvS2S 
- 
parts.transformer.utils.get_position_encoding(length, hidden_size, min_timescale=1.0, max_timescale=10000.0)[source]¶
- Return positional encoding. - Calculates the position encoding as a mix of sine and cosine functions with geometrically increasing wavelengths. Defined and formulized in Attention is All You Need, section 3.5. - Parameters: - length – Sequence length.
- hidden_size – Size of the
- min_timescale – Minimum scale that will be applied at each position
- max_timescale – Maximum scale that will be applied at each position
 - Returns: - Tensor with shape [length, hidden_size]