transformer¶
attention_layer¶
Implementation of multiheaded attention and self-attention layers.
-
class
parts.transformer.attention_layer.
Attention
(hidden_size, num_heads, attention_dropout, train, mode='loung', regularizer=None, window_size=None, back_step_size=None)[source]¶ Bases:
tensorflow.python.layers.base.Layer
Multi-headed attention layer.
-
call
(x, y, bias, cache=None, positions=None)[source]¶ Apply attention mechanism to x and y.
Parameters: - x – a tensor with shape [batch_size, length_x, hidden_size]
- y – a tensor with shape [batch_size, length_y, hidden_size]
- bias – attention bias that will be added to the result of the dot product.
- cache –
(Used during prediction) dictionary with tensors containing results of previous attentions. The dictionary must have the items:
- {“k”: tensor with shape [batch_size, i, key_channels],
- ”v”: tensor with shape [batch_size, i, value_channels]}
where i is the current decoded length.
- positions – decoder-encoder alignment for previous steps [batch_size, n_heads, length_x]
Returns: Attention layer output with shape [batch_size, length_x, hidden_size]
-
combine_heads
(x)[source]¶ Combine tensor that has been split.
Parameters: x – A tensor [batch_size, num_heads, length, hidden_size/num_heads] Returns: A tensor with shape [batch_size, length, hidden_size]
-
split_heads
(x)[source]¶ Split x into different heads, and transpose the resulting value.
The tensor is transposed to insure the inner dimensions hold the correct values during the matrix multiplication.
Parameters: x – A tensor with shape [batch_size, length, hidden_size] Returns: A tensor with shape [batch_size, num_heads, length, hidden_size/num_heads]
-
-
class
parts.transformer.attention_layer.
SelfAttention
(hidden_size, num_heads, attention_dropout, train, mode='loung', regularizer=None, window_size=None, back_step_size=None)[source]¶ Bases:
parts.transformer.attention_layer.Attention
Multiheaded self-attention layer.
-
call
(x, bias, cache=None)[source]¶ Apply attention mechanism to x and y.
Parameters: - x – a tensor with shape [batch_size, length_x, hidden_size]
- y – a tensor with shape [batch_size, length_y, hidden_size]
- bias – attention bias that will be added to the result of the dot product.
- cache –
(Used during prediction) dictionary with tensors containing results of previous attentions. The dictionary must have the items:
- {“k”: tensor with shape [batch_size, i, key_channels],
- ”v”: tensor with shape [batch_size, i, value_channels]}
where i is the current decoded length.
- positions – decoder-encoder alignment for previous steps [batch_size, n_heads, length_x]
Returns: Attention layer output with shape [batch_size, length_x, hidden_size]
-
beam_search¶
Beam search to find the translated sequence with the highest probability.
Source implementation from Tensor2Tensor: https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/beam_search.py
-
class
parts.transformer.beam_search.
SequenceBeamSearch
(symbols_to_logits_fn, vocab_size, batch_size, beam_size, alpha, max_decode_length, eos_id)[source]¶ Bases:
object
Implementation of beam search loop.
-
_continue_search
(state)[source]¶ Return whether to continue the search loop.
- The loops should terminate when
- when decode length has been reached, or
- when the worst score in the finished sequences is better than the best score in the alive sequences (i.e. the finished sequences are provably unchanging)
Parameters: state – A dictionary with the current loop state. Returns: Bool tensor with value True if loop should continue, False if loop should terminate.
-
_create_initial_state
(initial_ids, initial_cache)[source]¶ Return initial state dictionary and its shape invariants.
Parameters: - initial_ids – initial ids to pass into the symbols_to_logits_fn. int tensor with shape [batch_size, 1]
- initial_cache – dictionary storing values to be passed into the symbols_to_logits_fn.
Returns: state and shape invariant dictionaries with keys from _StateKeys
-
_get_new_alive_state
(new_seq, new_log_probs, new_cache)[source]¶ Gather the top k sequences that are still alive.
Parameters: - new_seq – New sequences generated by growing the current alive sequences int32 tensor with shape [batch_size, 2 * beam_size, cur_index + 1]
- new_log_probs – Log probabilities of new sequences float32 tensor with shape [batch_size, beam_size]
- new_cache – Dict of cached values for each sequence.
Returns: - {Top beam_size sequences that are still alive (don’t end with eos_id)
Log probabilities of top alive sequences Dict cache storing decoder states for top alive sequences}
Return type: Dictionary with alive keys from _StateKeys
-
_get_new_finished_state
(state, new_seq, new_log_probs)[source]¶ Combine new and old finished sequences, and gather the top k sequences.
Parameters: - state – A dictionary with the current loop state.
- new_seq – New sequences generated by growing the current alive sequences int32 tensor with shape [batch_size, beam_size, i + 1]
- new_log_probs – Log probabilities of new sequences float32 tensor with shape [batch_size, beam_size]
Returns: - {Top beam_size finished sequences based on score,
Scores of finished sequences, Finished flags of finished sequences}
Return type: Dictionary with finished keys from _StateKeys
-
_grow_alive_seq
(state)[source]¶ Grow alive sequences by one token, and collect top 2*beam_size sequences.
2*beam_size sequences are collected because some sequences may have reached the EOS token. 2*beam_size ensures that at least beam_size sequences are still alive.
Parameters: state – A dictionary with the current loop state. Returns: Tuple of (Top 2*beam_size sequences [batch_size, 2 * beam_size, cur_index + 1], Scores of returned sequences [batch_size, 2 * beam_size], New alive cache, for each of the 2 * beam_size sequences)
-
_search_step
(state)[source]¶ Beam search loop body.
Grow alive sequences by a single ID. Sequences that have reached the EOS token are marked as finished. The alive and finished sequences with the highest log probabilities and scores are returned.
A sequence’s finished score is calculating by dividing the log probability by the length normalization factor. Without length normalization, the search is more likely to return shorter sequences.
Parameters: state – A dictionary with the current loop state. Returns: new state dictionary.
-
-
class
parts.transformer.beam_search.
_StateKeys
[source]¶ Bases:
object
Keys to dictionary storing the state of the beam search loop.
-
ALIVE_CACHE
= 'ALIVE_CACHE'¶
-
ALIVE_LOG_PROBS
= 'ALIVE_LOG_PROBS'¶
-
ALIVE_SEQ
= 'ALIVE_SEQ'¶
-
CUR_INDEX
= 'CUR_INDEX'¶
-
FINISHED_FLAGS
= 'FINISHED_FLAGS'¶
-
FINISHED_SCORES
= 'FINISHED_SCORES'¶
-
FINISHED_SEQ
= 'FINISHED_SEQ'¶
-
-
parts.transformer.beam_search.
_expand_to_beam_size
(tensor, beam_size)[source]¶ Tiles a given tensor by beam_size.
Parameters: - tensor – tensor to tile [batch_size, …]
- beam_size – How much to tile the tensor by.
Returns: Tiled tensor [batch_size, beam_size, …]
-
parts.transformer.beam_search.
_flatten_beam_dim
(tensor)[source]¶ Reshapes first two dimensions in to single dimension.
Parameters: tensor – Tensor to reshape of shape [A, B, …] Returns: Reshaped tensor of shape [A*B, …]
-
parts.transformer.beam_search.
_gather_beams
(nested, beam_indices, batch_size, new_beam_size)[source]¶ Gather beams from nested structure of tensors.
Each tensor in nested represents a batch of beams, where beam refers to a single search state (beam search involves searching through multiple states in parallel).
This function is used to gather the top beams, specified by beam_indices, from the nested tensors.
Parameters: - nested – Nested structure (tensor, list, tuple or dict) containing tensors with shape [batch_size, beam_size, …].
- beam_indices – int32 tensor with shape [batch_size, new_beam_size]. Each value in beam_indices must be between [0, beam_size), and are not necessarily unique.
- batch_size – int size of batch
- new_beam_size – int number of beams to be pulled from the nested tensors.
Returns: - Nested structure containing tensors with shape
[batch_size, new_beam_size, …]
-
parts.transformer.beam_search.
_gather_topk_beams
(nested, score_or_log_prob, batch_size, beam_size)[source]¶ Gather top beams from nested structure.
-
parts.transformer.beam_search.
_length_normalization
(alpha, length)[source]¶ Return length normalization factor.
-
parts.transformer.beam_search.
_shape_list
(tensor)[source]¶ Return a list of the tensor’s shape, and ensure no None values in list.
-
parts.transformer.beam_search.
_unflatten_beam_dim
(tensor, batch_size, beam_size)[source]¶ Reshapes first dimension back to [batch_size, beam_size].
Parameters: - tensor – Tensor to reshape of shape [batch_size*beam_size, …]
- batch_size – Tensor, original batch size.
- beam_size – int, original beam size.
Returns: Reshaped tensor of shape [batch_size, beam_size, …]
-
parts.transformer.beam_search.
sequence_beam_search
(symbols_to_logits_fn, initial_ids, initial_cache, vocab_size, beam_size, alpha, max_decode_length, eos_id)[source]¶ Search for sequence of subtoken ids with the largest probability.
Parameters: - symbols_to_logits_fn –
A function that takes in ids, index, and cache as arguments. The passed in arguments will have shape:
ids -> [batch_size * beam_size, index] index -> [] (scalar) cache -> nested dictionary of tensors [batch_size * beam_size, …]- The function must return logits and new cache.
- logits -> [batch * beam_size, vocab_size] new cache -> same shape/structure as inputted cache
- initial_ids – Starting ids for each batch item. int32 tensor with shape [batch_size]
- initial_cache – dict containing starting decoder variables information
- vocab_size – int size of tokens
- beam_size – int number of beams
- alpha – float defining the strength of length normalization
- max_decode_length – maximum length to decoded sequence
- eos_id – int id of eos token, used to determine when a sequence has finished
Returns: Top decoded sequences [batch_size, beam_size, max_decode_length] sequence scores [batch_size, beam_size]
- symbols_to_logits_fn –
common¶
-
class
parts.transformer.common.
LayerNormalization
(hidden_size, params={})[source]¶ Bases:
tensorflow.python.layers.base.Layer
Layer normalization for BTC format: supports L2(default) and L1 modes
-
class
parts.transformer.common.
PrePostProcessingWrapper
(layer, params, training)[source]¶ Bases:
object
Wrapper around layer, that applies pre-processing and post-processing.
embedding_layer¶
Implementation of embedding layer with shared weights.
Bases:
tensorflow.python.layers.base.Layer
Calculates input embeddings and pre-softmax linear with shared weights.
Creates the variables of the layer.
Get token embeddings of x.
Parameters: x – An int64 tensor with shape [batch_size, length] Returns: float32 tensor with shape [batch_size, length, embedding_size] padding: float32 tensor with shape [batch_size, length] indicating the locations of the padding tokens in x.Return type: embeddings
Computes logits by running x through a linear layer.
Parameters: x – A float32 tensor with shape [batch_size, length, hidden_size] Returns: float32 tensor with shape [batch_size, length, vocab_size].
ffn_layer¶
Implementation of fully connected network.
utils¶
Transformer model helper methods.
-
parts.transformer.utils.
get_decoder_self_attention_bias
(length, dtype=tf.float32)[source]¶ Calculate bias for decoder that maintains model’s autoregressive property.
Creates a tensor that masks out locations that correspond to illegal connections, so prediction at position i cannot draw information from future positions.
Parameters: length – int length of sequences in batch. Returns: float tensor of shape [1, 1, length, length]
-
parts.transformer.utils.
get_padding
(x, padding_value=0, dtype=tf.float32)[source]¶ Return float tensor representing the padding values in x.
Parameters: - x – int tensor with any shape
- padding_value – int value that
- dtype – type of the output
Returns: - float tensor with same shape as x containing values 0 or 1.
0 -> non-padding, 1 -> padding
-
parts.transformer.utils.
get_padding_bias
(x, res_rank=4, pad_sym=0, dtype=tf.float32)[source]¶ Calculate bias tensor from padding values in tensor.
Bias tensor that is added to the pre-softmax multi-headed attention logits, which has shape [batch_size, num_heads, length, length]. The tensor is zero at non-padding locations, and -1e9 (negative infinity) at padding locations.
Parameters: - x – int tensor with shape [batch_size, length]
- res_rank – int indicates the rank of attention_bias.
- dtype – type of the output attention_bias
- pad_sym – int the symbol used for padding
Returns: Attention bias tensor of shape [batch_size, 1, 1, length] if res_rank = 4 - for Transformer or [batch_size, 1, length] if res_rank = 3 - for ConvS2S
-
parts.transformer.utils.
get_position_encoding
(length, hidden_size, min_timescale=1.0, max_timescale=10000.0)[source]¶ Return positional encoding.
Calculates the position encoding as a mix of sine and cosine functions with geometrically increasing wavelengths. Defined and formulized in Attention is All You Need, section 3.5.
Parameters: - length – Sequence length.
- hidden_size – Size of the
- min_timescale – Minimum scale that will be applied at each position
- max_timescale – Maximum scale that will be applied at each position
Returns: Tensor with shape [length, hidden_size]