Repository logo

Random Features for Efficient Attention Approximation



Change log


Likhosherstov, Valerii 


Transformers are, perhaps, the most widespread architectures in today's landscape of deep learning. They, however, do not scale well with long sequences, resulting in O(L^2) computational complexity for the sequence length L. In this thesis, we propose a holistic approach for reducing this complexity to linear O(L) via an unbiased approximation of the softmax kernel appearing in self-attention, the main component in the Transformer backbone. The obtained efficient Transformer architecture is referred to as Performer. Compared to other developments in the area of efficient Transformers, Performer is theory-grounded and the problem of long-sequence processing can be reduced to a theoretical derivation of random features with certain properties minimizing the variance of the approximation. This thesis describes an evolution of mechanisms for random-feature approximation: from the so-called FAVOR, to FAVOR+ and, finally, to FAVOR++. The FAVOR++ mechanism has the tightest concentration properties and the best performance in practice. On the way to FAVOR++, we also discuss several extensions of the proposed efficient self-attention mechanism, among which are masked self-attention, sub-linear memory Performers, generalized attention, and more. For each proposed method, this thesis contains empirical evaluations in real-life large-scale learning setups and thorough theoretical analyses with proofs.





Weller, Adrian


deep learning, self-attention, Transformer networks


Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge
Cambridge University Department of Engineering and Cambridge Trust