transformer¶

attention_layer¶

Implementation of multiheaded attention and self-attention layers.

class parts.transformer.attention_layer.Attention(hidden_size, num_heads, attention_dropout, train, mode='loung', regularizer=None, window_size=None, back_step_size=None)[source]¶

Bases: tensorflow.python.layers.base.Layer

Multi-headed attention layer.

call(x, y, bias, cache=None, positions=None)[source]¶

Apply attention mechanism to x and y.

Parameters:

x – a tensor with shape [batch_size, length_x, hidden_size]
y – a tensor with shape [batch_size, length_y, hidden_size]
bias – attention bias that will be added to the result of the dot product.
cache –
(Used during prediction) dictionary with tensors containing results of previous attentions. The dictionary must have the items:

{“k”: tensor with shape [batch_size, i, key_channels],

”v”: tensor with shape [batch_size, i, value_channels]}

where i is the current decoded length.
positions – decoder-encoder alignment for previous steps [batch_size, n_heads, length_x]

Returns:

Attention layer output with shape [batch_size, length_x, hidden_size]

combine_heads(x)[source]¶

Combine tensor that has been split.

Parameters:	x – A tensor [batch_size, num_heads, length, hidden_size/num_heads]
Returns:	A tensor with shape [batch_size, length, hidden_size]

split_heads(x)[source]¶

Split x into different heads, and transpose the resulting value.

The tensor is transposed to insure the inner dimensions hold the correct values during the matrix multiplication.

Parameters:	x – A tensor with shape [batch_size, length, hidden_size]
Returns:	A tensor with shape [batch_size, num_heads, length, hidden_size/num_heads]

class parts.transformer.attention_layer.SelfAttention(hidden_size, num_heads, attention_dropout, train, mode='loung', regularizer=None, window_size=None, back_step_size=None)[source]¶

Bases: parts.transformer.attention_layer.Attention

Multiheaded self-attention layer.

call(x, bias, cache=None)[source]¶

Apply attention mechanism to x and y.

Parameters:

x – a tensor with shape [batch_size, length_x, hidden_size]
y – a tensor with shape [batch_size, length_y, hidden_size]
bias – attention bias that will be added to the result of the dot product.
cache –
(Used during prediction) dictionary with tensors containing results of previous attentions. The dictionary must have the items:

{“k”: tensor with shape [batch_size, i, key_channels],

”v”: tensor with shape [batch_size, i, value_channels]}

where i is the current decoded length.
positions – decoder-encoder alignment for previous steps [batch_size, n_heads, length_x]

Returns:

Attention layer output with shape [batch_size, length_x, hidden_size]

beam_search¶

Beam search to find the translated sequence with the highest probability.

Source implementation from Tensor2Tensor: https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/beam_search.py

class parts.transformer.beam_search.SequenceBeamSearch(symbols_to_logits_fn, vocab_size, batch_size, beam_size, alpha, max_decode_length, eos_id)[source]¶

Bases: object

Implementation of beam search loop.

_continue_search(state)[source]¶

Return whether to continue the search loop.

The loops should terminate when

when decode length has been reached, or
when the worst score in the finished sequences is better than the best score in the alive sequences (i.e. the finished sequences are provably unchanging)

Parameters:	state – A dictionary with the current loop state.
Returns:	Bool tensor with value True if loop should continue, False if loop should terminate.

_create_initial_state(initial_ids, initial_cache)[source]¶

Return initial state dictionary and its shape invariants.

Parameters:	initial_ids – initial ids to pass into the symbols_to_logits_fn. int tensor with shape [batch_size, 1] initial_cache – dictionary storing values to be passed into the symbols_to_logits_fn.
Returns:	state and shape invariant dictionaries with keys from _StateKeys

_get_new_alive_state(new_seq, new_log_probs, new_cache)[source]¶

Gather the top k sequences that are still alive.

Parameters:

new_seq – New sequences generated by growing the current alive sequences int32 tensor with shape [batch_size, 2 * beam_size, cur_index + 1]
new_log_probs – Log probabilities of new sequences float32 tensor with shape [batch_size, beam_size]
new_cache – Dict of cached values for each sequence.

Returns:

{Top beam_size sequences that are still alive (don’t end with eos_id): Log probabilities of top alive sequences Dict cache storing decoder states for top alive sequences}

Return type:

Dictionary with alive keys from _StateKeys

_get_new_finished_state(state, new_seq, new_log_probs)[source]¶

Combine new and old finished sequences, and gather the top k sequences.

Parameters:

state – A dictionary with the current loop state.
new_seq – New sequences generated by growing the current alive sequences int32 tensor with shape [batch_size, beam_size, i + 1]
new_log_probs – Log probabilities of new sequences float32 tensor with shape [batch_size, beam_size]

Returns:

{Top beam_size finished sequences based on score,: Scores of finished sequences, Finished flags of finished sequences}

Return type:

Dictionary with finished keys from _StateKeys

_grow_alive_seq(state)[source]¶

Grow alive sequences by one token, and collect top 2*beam_size sequences.

2*beam_size sequences are collected because some sequences may have reached the EOS token. 2*beam_size ensures that at least beam_size sequences are still alive.

Parameters:	state – A dictionary with the current loop state.
Returns:	Tuple of (Top 2beam_size sequences [batch_size, 2 beam_size, cur_index + 1], Scores of returned sequences [batch_size, 2 * beam_size], New alive cache, for each of the 2 * beam_size sequences)

_search_step(state)[source]¶

Beam search loop body.

Grow alive sequences by a single ID. Sequences that have reached the EOS token are marked as finished. The alive and finished sequences with the highest log probabilities and scores are returned.

A sequence’s finished score is calculating by dividing the log probability by the length normalization factor. Without length normalization, the search is more likely to return shorter sequences.

Parameters:	state – A dictionary with the current loop state.
Returns:	new state dictionary.

search(initial_ids, initial_cache)[source]¶: Beam search for sequences with highest scores.

class parts.transformer.beam_search._StateKeys[source]¶

Bases: object

Keys to dictionary storing the state of the beam search loop.

ALIVE_CACHE = 'ALIVE_CACHE'¶

ALIVE_LOG_PROBS = 'ALIVE_LOG_PROBS'¶

ALIVE_SEQ = 'ALIVE_SEQ'¶

CUR_INDEX = 'CUR_INDEX'¶

FINISHED_FLAGS = 'FINISHED_FLAGS'¶

FINISHED_SCORES = 'FINISHED_SCORES'¶

FINISHED_SEQ = 'FINISHED_SEQ'¶

parts.transformer.beam_search._expand_to_beam_size(tensor, beam_size)[source]¶

Tiles a given tensor by beam_size.

Parameters:	tensor – tensor to tile [batch_size, …] beam_size – How much to tile the tensor by.
Returns:	Tiled tensor [batch_size, beam_size, …]

parts.transformer.beam_search._flatten_beam_dim(tensor)[source]¶

Reshapes first two dimensions in to single dimension.

Parameters:	tensor – Tensor to reshape of shape [A, B, …]
Returns:	Reshaped tensor of shape [A*B, …]

parts.transformer.beam_search._gather_beams(nested, beam_indices, batch_size, new_beam_size)[source]¶

Gather beams from nested structure of tensors.

Each tensor in nested represents a batch of beams, where beam refers to a single search state (beam search involves searching through multiple states in parallel).

This function is used to gather the top beams, specified by beam_indices, from the nested tensors.

Parameters:

nested – Nested structure (tensor, list, tuple or dict) containing tensors with shape [batch_size, beam_size, …].
beam_indices – int32 tensor with shape [batch_size, new_beam_size]. Each value in beam_indices must be between [0, beam_size), and are not necessarily unique.
batch_size – int size of batch
new_beam_size – int number of beams to be pulled from the nested tensors.

Returns:

Nested structure containing tensors with shape: [batch_size, new_beam_size, …]

parts.transformer.beam_search._gather_topk_beams(nested, score_or_log_prob, batch_size, beam_size)[source]¶: Gather top beams from nested structure.

parts.transformer.beam_search._length_normalization(alpha, length)[source]¶: Return length normalization factor.

parts.transformer.beam_search._shape_list(tensor)[source]¶: Return a list of the tensor’s shape, and ensure no None values in list.

parts.transformer.beam_search._unflatten_beam_dim(tensor, batch_size, beam_size)[source]¶

Reshapes first dimension back to [batch_size, beam_size].

Parameters:	tensor – Tensor to reshape of shape [batch_sizebeam_size, …] batch_size* – Tensor, original batch size. beam_size – int, original beam size.
Returns:	Reshaped tensor of shape [batch_size, beam_size, …]

parts.transformer.beam_search.sequence_beam_search(symbols_to_logits_fn, initial_ids, initial_cache, vocab_size, beam_size, alpha, max_decode_length, eos_id)[source]¶

Search for sequence of subtoken ids with the largest probability.

Parameters:

symbols_to_logits_fn –
A function that takes in ids, index, and cache as arguments. The passed in arguments will have shape:

ids -> [batch_size * beam_size, index] index -> [] (scalar) cache -> nested dictionary of tensors [batch_size * beam_size, …]

The function must return logits and new cache.

logits -> [batch * beam_size, vocab_size] new cache -> same shape/structure as inputted cache
initial_ids – Starting ids for each batch item. int32 tensor with shape [batch_size]
initial_cache – dict containing starting decoder variables information
vocab_size – int size of tokens
beam_size – int number of beams
alpha – float defining the strength of length normalization
max_decode_length – maximum length to decoded sequence
eos_id – int id of eos token, used to determine when a sequence has finished

Returns:

Top decoded sequences [batch_size, beam_size, max_decode_length] sequence scores [batch_size, beam_size]

common¶

class parts.transformer.common.LayerNormalization(hidden_size, params={})[source]¶

Bases: tensorflow.python.layers.base.Layer

Layer normalization for BTC format: supports L2(default) and L1 modes

build(_)[source]¶: Creates the variables of the layer.

call(x)[source]¶

This is where the layer’s logic lives.

Parameters:	inputs – Input tensor, or list/tuple of input tensors. **kwargs – Additional keyword arguments.
Returns:	A tensor or list/tuple of tensors.

class parts.transformer.common.PrePostProcessingWrapper(layer, params, training)[source]¶

Bases: object

Wrapper around layer, that applies pre-processing and post-processing.

class parts.transformer.common.Transformer_BatchNorm(training, params={})[source]¶

Bases: tensorflow.python.layers.base.Layer

Transformer batch norn: supports [BTC](default) and [BCT] formats.

call(x)[source]¶

This is where the layer’s logic lives.

Parameters:	inputs – Input tensor, or list/tuple of input tensors. **kwargs – Additional keyword arguments.
Returns:	A tensor or list/tuple of tensors.

embedding_layer¶

Implementation of embedding layer with shared weights.

class parts.transformer.embedding_layer.EmbeddingSharedWeights(vocab_size, hidden_size, pad_vocab_to_eight=False, init_var=None, embed_scale=True, pad_sym=0, mask_paddings=True, regularizer=None)[source]¶

Bases: tensorflow.python.layers.base.Layer

Calculates input embeddings and pre-softmax linear with shared weights.

build(_)[source]¶: Creates the variables of the layer.

call(x)[source]¶

Get token embeddings of x.

Parameters:	x – An int64 tensor with shape [batch_size, length]
Returns:	float32 tensor with shape [batch_size, length, embedding_size] padding: float32 tensor with shape [batch_size, length] indicating the locations of the padding tokens in x.
Return type:	embeddings

linear(x)[source]¶

Computes logits by running x through a linear layer.

Parameters:	x – A float32 tensor with shape [batch_size, length, hidden_size]
Returns:	float32 tensor with shape [batch_size, length, vocab_size].

ffn_layer¶

Implementation of fully connected network.

class parts.transformer.ffn_layer.FeedFowardNetwork(hidden_size, filter_size, relu_dropout, train, regularizer=None)[source]¶

Bases: tensorflow.python.layers.base.Layer

Fully connected feedforward network.

call(x, padding=None)[source]¶

This is where the layer’s logic lives.

Parameters:	inputs – Input tensor, or list/tuple of input tensors. **kwargs – Additional keyword arguments.
Returns:	A tensor or list/tuple of tensors.

utils¶

Transformer model helper methods.

parts.transformer.utils.get_decoder_self_attention_bias(length, dtype=tf.float32)[source]¶

Calculate bias for decoder that maintains model’s autoregressive property.

Creates a tensor that masks out locations that correspond to illegal connections, so prediction at position i cannot draw information from future positions.

Parameters:	length – int length of sequences in batch.
Returns:	float tensor of shape [1, 1, length, length]

parts.transformer.utils.get_padding(x, padding_value=0, dtype=tf.float32)[source]¶

Return float tensor representing the padding values in x.

Parameters:

x – int tensor with any shape
padding_value – int value that
dtype – type of the output

Returns:

float tensor with same shape as x containing values 0 or 1.: 0 -> non-padding, 1 -> padding

parts.transformer.utils.get_padding_bias(x, res_rank=4, pad_sym=0, dtype=tf.float32)[source]¶

Calculate bias tensor from padding values in tensor.

Bias tensor that is added to the pre-softmax multi-headed attention logits, which has shape [batch_size, num_heads, length, length]. The tensor is zero at non-padding locations, and -1e9 (negative infinity) at padding locations.

Parameters:	x – int tensor with shape [batch_size, length] res_rank – int indicates the rank of attention_bias. dtype – type of the output attention_bias pad_sym – int the symbol used for padding
Returns:	Attention bias tensor of shape [batch_size, 1, 1, length] if res_rank = 4 - for Transformer or [batch_size, 1, length] if res_rank = 3 - for ConvS2S

parts.transformer.utils.get_position_encoding(length, hidden_size, min_timescale=1.0, max_timescale=10000.0)[source]¶

Return positional encoding.

Calculates the position encoding as a mix of sine and cosine functions with geometrically increasing wavelengths. Defined and formulized in Attention is All You Need, section 3.5.

Parameters:	length – Sequence length. hidden_size – Size of the min_timescale – Minimum scale that will be applied at each position max_timescale – Maximum scale that will be applied at each position
Returns:	Tensor with shape [length, hidden_size]