Recurrent Neural Networks were great in mid-2017. They were able to do everything a successful sequence model was supposed to do, albeit with some drawbacks. Then transformers (Attention Is All You Need) came along and soon enough every state-of-the-art model in NLP was a transformer. Recently with OpenAI’s ImageGPT and DALL-E, Transformers have been increasingly applied as an autoregressive model for images as well. It has become increasingly clear that Transformers are extremely versatile models that can be applied to many tasks, not just text. Let’s first take a high-level insight into how transformers work.
The easiest way of thinking about a transformer is an encoder-decoder model that can manipulate pairwise connections within and between sequences. What this means is that a transformer can understand the connection between any two sequences as long as they have some sort of connection between tokens on both the encoder and decoder sides. For example, this means a transformer can easily translate between any two languages because every language has relationships between the words in a sentence. Another example is how transformers can be used to predict the next pixels in an image autoregressively.
This has also been shown to work in captioning models such as the Meshed Memory Transformer (Cornia et. al.) where the transformer translated between objects in an image and the English language. Since objects in an image have a pairwise relationship, the transformer was well suited for the task. Later it will be clear why transformers are great with sequences that have pairwise connections.
Let’s now get into the guts of the transformer. The transformer is characterized by its different training and inference procedures. During training, it computes all the outputs simultaneously while during inference, it computes the outputs sequentially, not unlike traditional RNNs. However, during training, where it would take much more time with an RNN, transformers can use their novel architecture to speed up training, becoming more efficient. In this article, we will cover the encoder architecture in-depth, while in the next one we will cover the decoder and the relationship between the encoder and decoder. The code below is a transformer to translate between English and German.
All of the following code will be taken from here and will be explained in depth.
We will start with the imports. I will explain each of these when they turn up.
The next cell deals with random seeding. This cell makes sure that all the random calls are consistent at the start of the notebook.
Here we download the spacy vocabulary (you can change the languages if you want, but right now we are using german).
In this cell, we make the tokenization functions to take a sentence in either language and decompose it into words.
The cell below has the torchtext field objects. The field object is a very versatile tool that can be used to create a vocabulary, convert words to integer token indices, and token indices to words. Essentially it creates a lookup table for all the words you provide it, with each word corresponding to exactly one token index. In this way, instead of encoding the words as a one-hot vector, which is very sparse and would be a waste of memory, it is encoded as a single integer, being the index of the one in its corresponding one-hot vector. The parameter batch_first means that the inputs to the transformer will be of the shape (N, S), where N is the batch size and S is the sequence length.
Now we can load the data. For this specific example, we will load the Multi30k dataset, and set the fields to the ones we specified before. This way when we create the train, test, and validation datasets, they will come prebuilt with our custom fields. After we have our data, we can initialize our fields with the vocabulary from the dataset. The min_freq argument is just a filter such that any word which appears only once in the dataset will not be included in the vocabulary. When the model comes across a word that is foreign to it, it will represent it with a <unk> token.
Below is the code for initializing the device, batch size, and iterators (in torchtext they are called iterators while in torch they are called dataloaders). Unlike the datasets we initialized above, iterators cannot be indexed, so during training, an enumerate() is usually called on them. This special type of iterator we are using is called a bucket iterator. Essentially it splits the data (sentences) into batches such that the least amount of padding will be needed. Why do we need padding? Simply because all sentences aren’t the same length and splitting them into batches requires us to put some padding on the shorter sentences.
Now we begin to get into the methods and classes that make up the transformer. We begin with this PositionwiseFeedForwardLayer class, which consists of two linear layers. Its objective is to transform the input dimension (hid_dim) into a substantially larger dimension (pf_dim), then transform it back into the input dimension for future layers. The reasoning behind this was not mentioned in the paper, but an intuition behind this would be that introducing more neurons could stimulate the model to represent different information in the same dimensionality. After this, a dropout layer is applied to discourage overfitting. The input size is (N, S, H) where N is the batch size, S is the sequence length, and H is the hidden dimension size. Between the two fully connected layers, the hidden dimension will change to the pf_dim size, and revert to H after the last fully connected layer.
The DNA of the Transformer model, multi-head attention allows the Transfomer to extrapolate the relationships between tokens in the sentences. To understand multi-head attention, scaled dot-product attention must be understood first.
In scaled dot-product attention, the input consists of a query matrix, a key matrix, and a value matrix. Next, the query matrix and the transposed key matrix are multiplied together. This results in a size of (N, Q, H)•(N, H, K)=(N, Q, K) where Q and K are the lengths of the query sequence and key sequence respectively. Intuitively, this multiplication results in a 2D map between elements in the query sequence and elements in the key sequence, providing a representation of the relationships between all the words in both the query sequence and the key sequence. After this, this dot-product map is run through a softmax over the keys dimension, which essentially squishes the values between 0 and 1 to represent how much the model should attend to a pair of tokens. Then this attention matrix is multiplied by the values matrix, which is usually the same size as the key matrix, which gives a final size of (N, Q, H). We can use a condensed example to understand what is going on: if we use an English sentence as the query, and a German sentence as the key, then the attention would represent how strongly each English word corresponds to each German word. When it is multiplied by the German sentence, it represents how much importance each English word should be given when compared with its German counterpart. Also notice that the input size and the output size are the same, allowing these attention layers to be stacked upon one another. Above is a visual depiction of this type of attention.
A more evolved type of scaled dot-product attention, multi-head attention uses multiple heads, similar to the multiple filters in a CNN, to encourage the model to understand multiple types of pairwise connections between the query and key sequences (considering that value is usually the same as key). First, the inputs are run through three respective linear layers. Note that the size of the linear layers is the same, but the actual layers are different themselves. The main highlight of this type of attention is that the hid_dim aspect of the input Q, K, and V, is split into two dimensions, one being the number of heads, and the other being the head_dim (shown in code below). This essentially splits each vector into h vectors each being head_dim size each. The normal scaled dot-product attention also has one key difference. Right before the softmax, the multiplication of the query and the key is divided by the square root of the head_dim. According to the paper authors, this prevents the results from the dot product from becoming too large, counteracting vanishing gradients. After all the heads have been calculated, they are concatenated, and a final linear layer is applied to this output. In the code below, there is a mask parameter, which for the encoder, just sets all the indices with the pad_token to essentially -inf so when the softmax is called, those values become 0. This prevents the model from attending to the padding tokens, which would make training and inference inconsistent.
Below is the EncoderLayer class, which contains the bulk of the operations that the encoder consists of. It begins with a self-attention layer, which essentially finds how the words in the input sequence relate to themselves. then, a layer normalization is applied, which makes each feature have a mean of 0 and a standard deviation of 1. Next, a position-wise feed-forward layer is applied, as previously explained. Another layer normalization is applied, and the encoder layer is done. The layer norms are used abundantly to prevent overfitting, which is a big problem with large neural networks such as these.
The Encoder class involves both the EncoderLayer class, as well as converting a sequence of word tokens into feature vectors with a positional encoding. Each index in the input sequence is transformed into a hid_dim size vector with the nn.Embedding() class. The nn.Embedding() is a lookup table that represents integer values with vectors. This is akin to converting a one-hot vector into a learned feature vector for every word. After this, since the model has no sense of position, a positional encoding is applied to the input, using the same nn.Embedding() the token embedding used. After this, it is run through the stack of encoder layers, using the padding indices as the attention mask. The output of this is an encoded input sequence that contains information about the input sequence through the multi-head self-attention. In the next article, we will cover how the decoder works and how the encoder feeds into the decoder.
The transformer put simply is a complex network. So complex that it takes multiple long articles just to explain it thoroughly. However, its complexity is not just a pretense, rather it allows the transformer to understand and manipulate sequences like no model before it. Since countless tasks in ML can be represented with sequences, it is not a surprise that the transformer has begun to dominate the field. Make sure to be on the lookout for Part 2 coming soon!