Learning sequential data is an important problem in machine leaning that arises in many applications such as time-series analysis and natural language processing. Previously, RNN with LSTM was the most popular model in this area, and nowadays, transformer with self-attention is the state of the art model for NLP applications.
In this post, I am going to put myself on behalf of transformer creators and rethink the problem. First, I will define the main challenge: summarizing a sequence in a fixed dimensional feature vector. Then, I will explain the self-attention solution: weighted averaging where the weighting function is a CNN with 1×1 convolutional filter. I will continue with the transformers idea: using the self-attention as a layer of a neural network and keep the number of parameters reasonable. And at the end, I will address how the problem is similar to the Mixture-of-experts. I will try to just focus on the main ideas and leave the details and implementation issues to the many great tutorials you can find on the net.
Suppose we have observed a sequence of n d-dimensional data (that can be raw input or extracted features) and we want to aggregate their information and estimate the output. For example, consider an RNN that takes a sentence of n words (x₁,…,xₙ) as input, transforms them to a hidden vector space that results in n d-dimensional features (h₁,…,hₙ), takes the average of all features (h), and finally estimates the label. In this example, the aggregation (or compression or summarization) is done by averaging: h = avg(h₁,…,hₙ). The question is what else we can do instead of simply taking the average?
The first idea comes to my mind is to concatenate the features to not loose any information, build an nd dimensional vector, and using a fully connected layer on top of that. But in this way, we should learn O(nd) number of parameters which can be too large for long sequences. Actually we wish the number of parameters to be independent of n, so that we can support sequences with any length. The two classic solution to this problem is taking the average (h = avg(h₁,…,hₙ)), or using the latest feature (h = hₙ) (which is a summary of all previous features in RNNs). Both of these methods result in O(d) number of parameters but with different weighting function: the later consider equal weights for all features, and the former gives all weights to the latest feature. However, the importance of features can be different from input to input (e.g. the word disgusting in the beginning of a sentence might be enough to label it as negative). Self-attention is way to learn a weighting function, and let the features decide about their importance themselves.
Now it’s time to build the weighting function which takes a sequence of features as input and set their weight for the averaging . The simplest solution is using a fixed linear transformation — which is simply a dot product — and then normalizing the evaluated scores to one:
where w is the fixed linear transformation that should be learned. Therefore, those features who really want to be important should align themselves with the direction of w and build a larger dot product. w can also be seen as a 1×1 convolutional filter applying on a 1xn image with depth d.
But why just considering one direction (filter) and a simple linear transformation? let’s make it k directions (see Fig. 1.) and add a simple non-linearity to learn a more complex function:
Here, I have just used the matrix notation: H is nxd (each row is one of the hᵢᵀ), W₁ is dxk (each column is a direction like w), W₂ is kx1(the final transform), and so S and A are both nx1. Now we should learn W₁ and W₂ with k(d+1) parameters (Check Fig. 2 for an example.).
In this way, we are learning to pay our attention to a specific segment of the sequence. Why not concentrating on different parts, and learning p>1 weighting function at the same time? Changing W₂ from kx1 to kxp is enough for this purpose and after that, we get nxp score matrix S, nxp weight matrix A, and dxp embedding matrix h. It is equivalent to add another 1×1 convolutional layer with p filters that results in a two layered CNN. In this way, we are actually learn to select the p most important features and concatenate them in a dxp embedding matrix (compare it to the idea of concatenating all features that was impractical).
Therefore, the self-attention weighting function is a two layered CNN that exploit CNNs weight sharing properties, and the number of parameters are k(d+p), where d is the initial dimension, k is the number of filters in the first layer, and p is the number of attention points.
Thus far, we have seen the core idea of self-attention that is learning a parametric weighting function with reasonable number of parameters. But that was just an extension of aggregating function. However, maybe we can use it as a standalone feature extractor, because it takes a sequence of n d-dim vectors and transform it to a sequence of p d-dim vectors. And then what if we can use this feature extractor as a layer of a neural network? and maybe make it deep later?
I think, they were the thoughts of the authors of  when they were designing transformers and creating self-attention layer. They wanted to remove RNNs and replace them with a stacked self-attention layers. And there was an important challenge that should be dealt with: the number of parameters. Intuitively, we should be able to concentrate on p=n (or O(n)) elements in the lower layers and if so, the number of parameters will be k(d+n) which is dependent on n.
To overcome this challenge, a nice idea has been employed: Setting p=n, W₂ is a kxn matrix with many number of parameters, and we should somehow decrease it. So lets decompose it to (HK)ᵀ, where H is the original nxd matrix and K is a dxk matrix. Incorporating H leads to a high decrease in the number of parameters where we only need to learn the matrix K with kd number of elements. Removing the tanh function, we can rewrite the formulas as follows:
Another minor extension: the input dimension is d, and in the previous version we also get some d dimensional output vectors; Lets also reduce the input dimension from d to k (number of filters), and get some k-dimensional output by changing the embedding formula to h=(HV)ᵀA, where V is a dxk matrix:
And that’s transformer! Whatever you may see in  is the details of implementation and changes in notation: name HW₁ query, HK key, and HV value matrices. Now the number of parameters are 3dk where d and k are the dimensions of input and output vectors, respectively, and we get a sequence of n k-dim vectors, after applying the self-attention layer on a sequence of n d-dim vectors. As you see, the number of parameters are independent of n and we can stack self-attention layers and build a deep model, which is transformer. I am not going to dig in the implementation details and suggest the interested readers to see  for that.
The MoE idea is to judge based on different experts opinions: for an arbitrary input value, collect all experts’ opinions, and take their weighted average, where weights are based on the level of experts’ specialty on the region of that input. For example, in a regression problem, the output may be more a linear function of the input in some regions, and more a quadratic function in some other regions. The concern of MoE is how to learn the experts themselves and the weighting function (named gating network there) simultaneously.
Now assume h₁,…,hₙ are the opinions of different experts and the self-attention is the gating network which set weights based on the feature values. The idea is the same and the attention weighting network is choosing the level of importance of the features based on their values. Also it is very similar to Mixture density networks where the mixture coefficients are predicted using a neural network (see see 5.6 and 14.5.3 from ).