Learning sequential data is an important problem in machine leaning that arises in many applications such as time-series analysis and natural language processing. Previously, RNN with LSTM was the most popular model in this area, and nowadays, transformer with self-attention is the state of the art model for NLP applications.

In this post, I am going to put myself on behalf of transformer creators and rethink the problem. First, I will define the main challenge: *summarizing a sequence in a fixed dimensional feature vector.* Then, I will explain the self-attention solution: *weighted averaging where the weighting function is a CNN with 1×1 convolutional filter*. I will continue with the transformers idea: *using the self-attention as a layer of a neural network and keep the number of parameters reasonable*. And at the end, I will address how the problem is similar to the Mixture-of-experts. I will try to just focus on the main ideas and leave the details and implementation issues to the many great tutorials you can find on the net.

Suppose we have observed a sequence of *n* *d*-dimensional data (that can be raw input or extracted features) and we want to **aggregate** their information and estimate the output. For example, consider an RNN that takes a sentence of *n* words *(x₁,…,xₙ)* as input, transforms them to a hidden vector space that results in *n* *d*-dimensional features *(h₁,…,hₙ)*, takes the average of all features *(h)*, and finally estimates the label. In this example, the aggregation (or compression or summarization) is done by averaging: *h = avg(h₁,…,hₙ).* The question is what else we can do instead of simply taking the average?

The first idea comes to my mind is to concatenate the features to not loose any information, build an *nd *dimensional vector, and using a fully connected layer on top of that. But in this way, we should learn *O(nd)* number of parameters which can be too large for long sequences. Actually we wish the number of parameters to be independent of *n*, so that we can support sequences with any length. The two classic solution to this problem is taking the average *(h = avg(h₁,…,hₙ))*, or using the latest feature *(h = hₙ)* (which is a summary of all previous features in RNNs)*. *Both of these methods result in *O(d)* number of parameters but with different *weighting function*: the later consider equal weights for all features, and the former gives all weights to the latest feature. However, the importance of features can be different from input to input (e.g. the word *disgusting* in the beginning of a sentence might be enough to label it as negative). **Self-attention is way to learn a weighting function**, and let the features decide about their importance themselves.

Now it’s time to build the weighting function which takes a sequence of features as input and set their weight for the averaging [1]. The simplest solution is using a fixed linear transformation — which is simply a dot product — and then normalizing the evaluated scores to one: