transformer

attention_layer

Implementation of multiheaded attention and self-attention layers.

class parts.transformer.attention_layer.Attention(hidden_size, num_heads, attention_dropout, train, mode='loung', regularizer=None, window_size=None, back_step_size=None)[source]

Bases: tensorflow.python.layers.base.Layer

Multi-headed attention layer.

call(x, y, bias, cache=None, positions=None)[source]

Apply attention mechanism to x and y.

Parameters:
  • x – a tensor with shape [batch_size, length_x, hidden_size]
  • y – a tensor with shape [batch_size, length_y, hidden_size]
  • bias – attention bias that will be added to the result of the dot product.
  • cache

    (Used during prediction) dictionary with tensors containing results of previous attentions. The dictionary must have the items:

    {“k”: tensor with shape [batch_size, i, key_channels],
    ”v”: tensor with shape [batch_size, i, value_channels]}

    where i is the current decoded length.

  • positions – decoder-encoder alignment for previous steps [batch_size, n_heads, length_x]
Returns:

Attention layer output with shape [batch_size, length_x, hidden_size]

combine_heads(x)[source]

Combine tensor that has been split.

Parameters:x – A tensor [batch_size, num_heads, length, hidden_size/num_heads]
Returns:A tensor with shape [batch_size, length, hidden_size]
split_heads(x)[source]

Split x into different heads, and transpose the resulting value.

The tensor is transposed to insure the inner dimensions hold the correct values during the matrix multiplication.

Parameters:x – A tensor with shape [batch_size, length, hidden_size]
Returns:A tensor with shape [batch_size, num_heads, length, hidden_size/num_heads]
class parts.transformer.attention_layer.SelfAttention(hidden_size, num_heads, attention_dropout, train, mode='loung', regularizer=None, window_size=None, back_step_size=None)[source]

Bases: parts.transformer.attention_layer.Attention

Multiheaded self-attention layer.

call(x, bias, cache=None)[source]

Apply attention mechanism to x and y.

Parameters:
  • x – a tensor with shape [batch_size, length_x, hidden_size]
  • y – a tensor with shape [batch_size, length_y, hidden_size]
  • bias – attention bias that will be added to the result of the dot product.
  • cache

    (Used during prediction) dictionary with tensors containing results of previous attentions. The dictionary must have the items:

    {“k”: tensor with shape [batch_size, i, key_channels],
    ”v”: tensor with shape [batch_size, i, value_channels]}

    where i is the current decoded length.

  • positions – decoder-encoder alignment for previous steps [batch_size, n_heads, length_x]
Returns:

Attention layer output with shape [batch_size, length_x, hidden_size]

common

class parts.transformer.common.LayerNormalization(hidden_size, params={})[source]

Bases: tensorflow.python.layers.base.Layer

Layer normalization for BTC format: supports L2(default) and L1 modes

build(_)[source]

Creates the variables of the layer.

call(x)[source]

This is where the layer’s logic lives.

Parameters:
  • inputs – Input tensor, or list/tuple of input tensors.
  • **kwargs – Additional keyword arguments.
Returns:

A tensor or list/tuple of tensors.

class parts.transformer.common.PrePostProcessingWrapper(layer, params, training)[source]

Bases: object

Wrapper around layer, that applies pre-processing and post-processing.

class parts.transformer.common.Transformer_BatchNorm(training, params={})[source]

Bases: tensorflow.python.layers.base.Layer

Transformer batch norn: supports [BTC](default) and [BCT] formats.

call(x)[source]

This is where the layer’s logic lives.

Parameters:
  • inputs – Input tensor, or list/tuple of input tensors.
  • **kwargs – Additional keyword arguments.
Returns:

A tensor or list/tuple of tensors.

embedding_layer

Implementation of embedding layer with shared weights.

class parts.transformer.embedding_layer.EmbeddingSharedWeights(vocab_size, hidden_size, pad_vocab_to_eight=False, init_var=None, embed_scale=True, pad_sym=0, mask_paddings=True, regularizer=None)[source]

Bases: tensorflow.python.layers.base.Layer

Calculates input embeddings and pre-softmax linear with shared weights.

build(_)[source]

Creates the variables of the layer.

call(x)[source]

Get token embeddings of x.

Parameters:x – An int64 tensor with shape [batch_size, length]
Returns:float32 tensor with shape [batch_size, length, embedding_size] padding: float32 tensor with shape [batch_size, length] indicating the
locations of the padding tokens in x.
Return type:embeddings
linear(x)[source]

Computes logits by running x through a linear layer.

Parameters:x – A float32 tensor with shape [batch_size, length, hidden_size]
Returns:float32 tensor with shape [batch_size, length, vocab_size].

ffn_layer

Implementation of fully connected network.

class parts.transformer.ffn_layer.FeedFowardNetwork(hidden_size, filter_size, relu_dropout, train, regularizer=None)[source]

Bases: tensorflow.python.layers.base.Layer

Fully connected feedforward network.

call(x, padding=None)[source]

This is where the layer’s logic lives.

Parameters:
  • inputs – Input tensor, or list/tuple of input tensors.
  • **kwargs – Additional keyword arguments.
Returns:

A tensor or list/tuple of tensors.

utils

Transformer model helper methods.

parts.transformer.utils.get_decoder_self_attention_bias(length, dtype=tf.float32)[source]

Calculate bias for decoder that maintains model’s autoregressive property.

Creates a tensor that masks out locations that correspond to illegal connections, so prediction at position i cannot draw information from future positions.

Parameters:length – int length of sequences in batch.
Returns:float tensor of shape [1, 1, length, length]
parts.transformer.utils.get_padding(x, padding_value=0, dtype=tf.float32)[source]

Return float tensor representing the padding values in x.

Parameters:
  • x – int tensor with any shape
  • padding_value – int value that
  • dtype – type of the output
Returns:

float tensor with same shape as x containing values 0 or 1.

0 -> non-padding, 1 -> padding

parts.transformer.utils.get_padding_bias(x, res_rank=4, pad_sym=0, dtype=tf.float32)[source]

Calculate bias tensor from padding values in tensor.

Bias tensor that is added to the pre-softmax multi-headed attention logits, which has shape [batch_size, num_heads, length, length]. The tensor is zero at non-padding locations, and -1e9 (negative infinity) at padding locations.

Parameters:
  • x – int tensor with shape [batch_size, length]
  • res_rank – int indicates the rank of attention_bias.
  • dtype – type of the output attention_bias
  • pad_sym – int the symbol used for padding
Returns:

Attention bias tensor of shape [batch_size, 1, 1, length] if res_rank = 4 - for Transformer or [batch_size, 1, length] if res_rank = 3 - for ConvS2S

parts.transformer.utils.get_position_encoding(length, hidden_size, min_timescale=1.0, max_timescale=10000.0)[source]

Return positional encoding.

Calculates the position encoding as a mix of sine and cosine functions with geometrically increasing wavelengths. Defined and formulized in Attention is All You Need, section 3.5.

Parameters:
  • length – Sequence length.
  • hidden_size – Size of the
  • min_timescale – Minimum scale that will be applied at each position
  • max_timescale – Maximum scale that will be applied at each position
Returns:

Tensor with shape [length, hidden_size]