
Optimizing self-attention and feed-forward transformer blocks to achieve higher performance with lower compute time
The PAR Transformer needs 35% lower compute time than Transformer-XL achieved by replacing 63% of the self attention blocks with feed-forward blocks, and retains the perplexity on WikiText-103 language modelling benchmark.
Source: PAR paper on arxiv
The transformer optimization streak continues. Google released their Switch transformers, Microsoft released ZeRo Offload, Facebook released Data-efficient transformers and now NVIDIA releases PAR which optimizes the use of attention in transformers.
The reason why these companies are concentrating on optimizing transformers is due to the huge success that transformers achieved in NLP and now in Image processing. Moreover, those transformers are hugely increasing in terms of the number of parameters and so several tricks are introduced to cope with the hardware limitations.
The paper focuses on the trade-offs between self-attention and feed-forward building blocks [1]. They focus on one of the areas in ML that I have always been curious about, neural networks architecture search. But, they do it in a novel way as traditional architecture search is quite expensive. For instance, if you are searching a space [1] where there are 2 options per layer, and let’s say there are only 10 layers (which is a very small network), the size of the search space would be 2¹⁰ which is huge.
For this reason, we explored the use of differential neural architecture search that has linear complexity to redesign the transformer architecture in this paper.
Source: PAR paper on arxiv
Their study suggests that the self-attention layers are only needed for the first 2/3 of the network and only 1/5 for huge transformers (such as Transformer XL). They confirm this hypothesis on benchmark datasets and models (such as BERT -> PAR BERT).
Let’s start exploring how they did that
Let’s start with the most common transformer design pattern, it’s an “interleaved” pattern [1]. This means that every self-attention blocked is followed by a feed-forward block (so typical transformers have an equal number of self-attention and feed-forward blocks).
One of the main building blocks of this paper is SuperNet, which is quite a novel approach to Neural Architecture Search (NAS). If you aren’t familiar with NAS, it’s essentially using Neural Networks to find the optimal hyperparameter for other neural networks. SuperNet sounds super interesting to me, if that’s the case for you too, let me know in the comments and I will write an article explaining how it works.
Since the search also consists of training only one supernet consisting of all the search blocks, it is orders of magnitude faster than RL based search algorithms that rely on training individual combinations of search blocks. For our choice of supernet, the search cost was < 2× the training cost of the baseline. All of our experiments use the same model architecture parameters as the baseline.
Source: PAR paper on arxiv
Their approach falls under a differentiable NAS (gradient-based loss function) that has been shown to be more efficient and reliable than other NAS approaches. This is similar to the FBNet algorithm [1] that tries to find a distribution that models the optimal architecture.
One of the best things about this paper is that they attempt to provide a generalizable rule for NAS on transformers. Which is quite a difficult task. How do they do that? By performing several thought-out experiments and proposing concrete arguments and conclusions. Although this might seem like “conventional research”, tons of ML papers are quite situational, not too generalizable, and difficult to reproduce.
Although self-attention blocks result in higher accuracy, provide useful contextual meaning to the network, they are computationally expensive. The research here shows some very important which is that the need for those self-attention blocks is saturated fairly quickly.
They notice that for most architectures, it’s more efficient in terms of performance to have less than 50% of the layers as self-attention blocks, in fact for the Transformer-XL they suggest around 20%. Furthermore, those blocks should be only in the first 2/3 of the network (not interleaved throughout the network). This results in [1] 63% fewer self-attention blocks and 35% lower compute time.
Validation and experiments
They validate their model by training it on popular datasets and comparing it to baseline successful models. The 3 datasets used here are WikiText, enwiki, and text. The biggest and the most important one is WikiText, it consists of over 100 million from articles on Wikipedia [1]. They compare the latency on an A100 GPU and the results are incredible:
Final Thoughts and take away
I was a bit skeptical about NAS before, the idea of Neural Networks “creating” other Neural Networks just didn’t sound compelling enough to me, but I guess I was wrong. NAS is growing much more than I thought and I think we can all expect to be reading many more papers and articles about NAS approaches.
It’s also great to see that ML researchers are trying to optimize and further study existing approaches before quickly moving on to another one, I think this detailed analysis of transformer optimization will be quite beneficial in the long run. I was however surprised by the sheer amount of transformer optimization papers that have been released in the last few weeks. If you are interested in that, check out these articles:
References:
[1] Pay Attention when Required. Swetha Mandava and Szymon Migacz and Alex Fit Florea. 2020