Neural Machine Translation with Transformers

Gal Hever
11 min readApr 17, 2020

--

Introduction

This blog-post will review step-by-step the Transformer approach that was presented for the first time at “Attention Is All You Need” article. This paper was written by Google research team and became the new state of the art model for translation quality tasks. Let’s start with few basic questions: “What are the Transformers?”, “Why they perform better than RNNs-based architectures for machine translation problems?” and “What does it mean Seq2Seq?”

Machine Translation & Sequence-to-Sequence

A sequence-to-sequence (Seq2Seq) task deals with a certain sequence (e.g., words, genes, etc) that its output is also a sequence. An example of such a problem is a machine translation that gets a sequence of words in English that will be translated to a sequence of Hebrew words. Some other examples are questions answering, part-of-speech tagging, etc. There are few ways to solve the translation problem e.g., RNNs, ConvS2S and Transformers-based models.

Transformers rather than RNNs for Seq2Seq

In the last few years the RNN-based architectures have shown the best performance in machine translation problems, but still they have some problems that had to be solved. First, they have a difficulty to cope with long-range dependencies (also LSTM when it has to deal with really long sentences). Secondly, each hidden state depends on the previous one which impossible to parallelize and makes it inefficient on GPUs.

Those models are composed of two parts: the encoder and the decoder. In the processing level, the encoder shrinks all the semantics of the sequence to a single vector and passes it to the decoder. The decoder processes the information and makes the predictions. The problem is that the decoder needs different information at different time-steps, and it gets all the information on the dependencies inside the sequence in one processed vector. In this way, it’s really difficult to the decoder to crack and extract the hidden information, which is the core problem and the basic motivation behind the attention mechanism that was added later on. The attention mechanism enables the decoder to look backward on the whole input sequence and selectively extract the information it needs during processing.

LSTM helped to deal with long sequences, but still, it had failed to maintain the global information of the source sentence. For instance, in the sentence “I like dancing more than swimming”, the word “like” corresponds to the word “dancing”. This relation has to be saved in the long-term memory while reading the source sentence and to be used when generating the target sentence. In contrast, after it translates the word “dancing”, it needs to know what the word “dancing” being compared with, but no longer needs to remember the word “dancing”.

Another thing is that those networks (RNN, LSTM, GRU) also process sequences of symbol representations key by key separately and don’t work on the entire sequence in one go.

What is a Transformer?

The new model that was proposed to solve the machine translation problem is the Transformer that relies mostly on the attention mechanism to draw the dependencies. Compared the previous architectures the Transformer achieved the highest BLEU scores on machine translation tasks that were tested by Google research team and required significantly less calculations and time to train.

In each step, the transformer applies a self-attention mechanism which directly models relationships between all words in a sentence, regardless of their respective position.

Like RNNs, the Transformer is an architecture for transforming one sequence into another using the encoder-decoder mechanism, but it differs from the previous existing Seq2Seq models because it does not imply any Recurrent Network (GRU, LSTM, etc). Yet, unlike the RNNs the Transformer handles the entire input sequence in at once and doesn’t iterate word by word.

Now, let’s go deep into the architecture of the transformer and understand how it works.

Below there is a diagram of the Transformer architecture. The left blue box is the encoder and the right blue box is the decoder.

The Encoder

The encoder takes each word in the input sentence, process it to an intermediate representation and compares it with all the other words in the input sentence. The result of those comparisons is an attention score that evaluates the contribution of each word in the sentence to the key word. The attention scores are then used as weights for words’ representations that are fed the fully-connected network that generates a new representation for the key word. It does so for all the words in the sentence and transfers the new representation to the decoder that by this information can have all the dependencies that it needs to build the predictions.

The Encoder’s Structure

As you can see in the draw above, the encoder is composed of 2 parts that are replicated 6 times (an arbitrary number):

  1. Multi-Head Self-Attention Mechanism
  2. Fully-Connected Feed-Forward Network

The encoder receives the embedding vectors as a list of vectors, each of 512 (can be tuned as a hyper-parameter) size dimension. Both the encoder and the decoder add a positional encoding (that will be explained later) to their input. Both also use a bypass that is called a residual connection followed by an addition of the original input of the sub-layer and another normalization layer (which is also known as a batch normalization).

The Decoder

The decoder has access to all of the encoder’s hidden states that are used in each step to predict the next word. Not all of the hidden states are relevant in every step; hence, each hidden state is weighted differently for each decoder’s state, while the model learns to “focus” on the relevant parts in the input according to the target. In each iteration, the decoder receives the encoder’s input and the decoder’s last output and uses both for the next step.

The Decoder’s Structure

The decoder is composed of 3 parts that are replicated 6 times:

  1. Masked Multi-Head Self-Attention Mechanism
  2. Multi-Head Self-Attention Mechanism
  3. Fully-Connected Feed-Forward Network

Unlike the encoder, the decoder uses an addition to the Multi-head attention that is called masking. This operation is intended to prevent exposing posterior information from the decoder. It means that in the training level the decoder doesn’t get access to tokens in the target sentence that will reveal the correct answer and will disrupt the learning procedure. It’s really important part in the decoder because if we will not use the masking the model will not learn anything and will just repeat the target sentence.

Positional Encoding

The Transformer differs the Recurrent Networks models by processing the elements as one-piece which causes to loose the information on the order of the elements in the sequence. This information is significant for the next steps; therefore, the attention uses a function before the encoder that adds to the embedding vector another part that is called Positional Encoding (PE). This function produces an index that shows the precise word’s location in the sentence based on sine and cosine functions.

Residual Connections

Similar to ResNet architecture, the Transformer adds to each output’s sub-layer the input that entered this sub-layer before processing (It is represented by “Add & Norm” in yellow boxes in Figure 1). The idea behind the residual connection is to make optimization easier. It preserves the information before each operation to enable faster learning in the back-propagation phase that compares the model’s output with the target we want to achieve.

Since we want to update the weights correctly along the way, the input before each sub-layer enables the back-propagation procedure to update the weights easily and efficiently. The network will know faster how to update the last layers that are closer to the actual output; but, the layers in the earlier stages blur over time due to the mathematical operations along the way. If something went wrong at some point on the way, instead of trying to deal with the complex output, the addition of the origin input helps learning the weights. It is a kind of intermediate stage of input-output comparison after each action that prevents from creating a monstrous output at the end of the decoding procedure. Without this operation it will be really hard for the attention mechanism to find out how to change each sub-layer in the learning step to get better results.

Fully Connected Feed-Forward Network in the Transformer

Both, the encoder and the decoder have two fully-connected layers followed by activation function (ReLU). The goal of those layers is to enrich the learning process and to enable it to learn new dependencies on the sequence by itself.

Multi-Head-Self-Attention

The principle of self-attention is to check how each word in a sentence is related to the other words. To check the similarity between two words that are represented by vectors, the attention uses the dot product operation. For example, let’s take a look on the sentences below:

“The animal didn’t cross the street because it was too tired”

“The animal didn’t cross the street because it was too wide”

In the first one the word “it” refers to the animal and in the second “it” refers to the street. When the model processes the word “it”, the self-attention finds the associations of “it” with the word “animal” in the first case and with the word “street” in the second case.

Attention can capture just one aspect on the sentence, and as we know sentences usually have more than one. For instance, encoding the word “it”, will produce by one attention head the relation with the word “animal”, while the other one will find the relation with the word “tired”; hence, it is called “Multi-Head”. The Transformer uses multiple attentions that each deals with different aspect and together they preserve all the relations in the sentence.

So how does the Multi-Head Attention actually work?

The Multi-Head Attention contains the Scaled Dot-Product Attention as described in the draw below:

Now, we will go over the formula of the Scaled Dot-Product Attention that is computed according to the following equation:

First step — Create Key, Value, Query

If we will look carefully at Figure 1 we will see that there are three arrows entering into the orange box of the Multi-Head Self-Attention. Those three arrows represent three input vectors that are called: Query, Key and Value. Those vectors are created for each word by multiplying the embedding vectors by three matrices that are trained during the training process.

The attention operation can be thought of as a retrieval system process that applies the key/value/query concepts. For example, when typing a query to find some video on Youtube in a search engine, the search engine will try to find a set of keys (video title, description, etc) according to the constraints of the query. The retrieved videos will be presented in a list of the best matched in a descending order. This ranking is parallel to the weighted values calculated in the attention.

Query vector: Represented by a word vector in the sequence and defines the hidden state of the decoder.

Key vector: Represented by all the words in the sequence and defines the hidden state of the encoder.

Value vector: Represented by all the words in the sequence and defines the attention weights of the encoder hidden states.

As you can see in Figure 1, the encoder and the decoder apply self-attention separately to the source and the target sequences respectively. On top of that just the decoder applies another attention, where Q is taken from the target sequence and K, V are taken from the source sequence.

Notice: By their architecture, each of the Q/K/V vectors has 64 dimensions, while the embedding and encoder input/output vectors have 512 dimensions.

We will obtain those vectors by multiplying the embedding vector (e.g., “X1” represents the word “Thinking”) of each word by Wq, Wk and Wv weight matrices. This multiplication will produce q1, k1 and v1 (Query, Key and Value vectors) that are associated with the word “Thinking” and q2, k2, v2 that are associated with the word “Machines”.

Second Step — Scores Calculation

The second step is calculating the scores. Say we’re calculating the self-attention for the first word, in this example “Thinking”, the score will be calculated for each word of the input sentence against this word.

To calculate the score we will dot product between the query vector q1 and the key vector of the word that we score. For instance, for scoring the word “Machines” while calculating the self-attention for “Thinking” we will dot product q1 with k2 and we will get 96, as you can see in the diagram below.

Third Step — Normalization

In the equation of the self-attention above, dk represents the dimension of the queries and keys (64 dimensions) and is used to normalize the scores.

Forth Step — Softmax

Then, the divided scores are transferred to softmax function. This function calculates the probabilities and forces the scores to be positive and be summed up to 1.

Fifth Step — Multiply by Value

Next, each softmax’s score will be multiplied by the value vector of each word.

Sixth Step — Summary Vector

Finally, the output of the self-attention layer for the current word is given by summing up the weighted value vectors.

Summarizing the Transformer Concept

The animation below illustrates the processing occurs inside the Transformer while working on machine translation task. First, the Transformer starts by generating initial representations (embedding) for each word, that are represented by the unfilled circles. Then, the encoder uses the embedding to generate the key, query and value vectors for each of the words, that are represented by the filled circles.

The decoder then generates the output sentence word by word while considering the representation vector that was created by the encoder and the decoder’s last output.

End Notes

This blog-post summarized the theoretical phase of the Transformer method. It reviewed step-by-step and explained the meaning behind this new architecture. Soon hopefully I’ll publish another tutorial that will engage the practical phase as well:)

References

--

--