How Seq2Seq (Sequence to Sequence) Models improved Into Transformers Using Attention Mechanism

Natural Language Processing or NLP is a branch of Artificial Intelligence that helps machines to understand humans natural languange and humans comunicate with each other majorly in voice or text but how does NLP work? Well, one type of data that we can deal with in machine learning is sequential data which is a sequence of data (e.g., text, voice). NLP can builds systems that take a sequence of data as an input, process it, and produce another sequence of data. The letter is called sequential model or Seq2seq (sequence (input) to sequence (output)).

In this blog we will explore what is sequential model and the journey from seq2seq to transformer. I will assume the input is a text/ sentence in this post. Please note that I am not diving in details of the architecture of the models we are explore. However, I will explain the high level idea and how the transformers were introduced.

Seq2Seq

seq2seq is an encoder-decoder based model that takes a sentence (sequence of words) as an input and produce another sentence as an output. The encoder is a multiple RNN (Recurrent Neural Network) cells that can be stacked together. RNN reads the input sequentially to encode the sentence word by word and produce an output, this output is the final Hidden State (Also known as encoder or context vector) which is an encapsulated information of the input. This vector will be the input of the decoder. The decoder is also a stack of RNN cells but aims to decode the hidden state and convert it to predict an output (words).

If you need to dive deeper and read more about seq2seq, here is “Sequence Learning” [1], a research paper published by Google for further reading.

This type of model was used in tasks like machine translation, voice and entity recognition, sentiment classification and more. However, the hidden state vector is a fixed-size length and thus it is hard for the encoder to encapsulate/ compress all the information from the sentence. If the input is long, it can suffer from what is called the information bottleneck. Here where the Attention mechanism was proposed as a solution for this limitation.

Attention Mechanism

Attention in English means the state of applying the focus on something and it is the case in Deep Learning. Attention mechanism was introduced by Dzmitry Bahdanau, et al. in their paper [2]. It is added to the encoder-decoder model to help the model and specifically the decoder to focus on the relevant words/ parts of the sentence/ input sequence. In particular, it enables the encoder to encapsulate all the information from the input and pass all the produced hidden states instead of only the final one as in seq2seq. This technique proved and showed better results.

Now what if we want to make the model faster? Here where the Transformer was introduced.

Transformer

Transformer was proposed by Google in “Attention Is All You Need” [3] paper . Transformer is a deep learning model which is also an encoder-decoder architecture based that uses Attention mechanism but without relying on RNNs in order to speed up the model. It is consists of an encoder which can be considered as a stack of encoders/ encoding layers (six layers in the paper) and the same number of a stack decoders .

The component of each encoder/ encoding layer is two sub-layers including Attention layer and Feed-Forward Neural Network (FFNN) while the decoder block is made up of three sub-layers including two attention layers and one FFNN layer.

The Transformer — model architecture [3].

For further reading, an “Attentive Survey of Attention Models” [4] paper provides a structured and comprehensive overview of the developments in modeling attention by providing an attention taxonomy.

Also, you can find here “The Annotated Transformer” which is an “annotated” version of the “Attention Is All You Need” paper is presented in the form of a line-by-line implementation by Harvard NLP team.