Random Features for Efficient Attention Approximation
Repository URI
Repository DOI
Change log
Authors
Abstract
Transformers are, perhaps, the most widespread architectures in today's landscape of deep learning. They, however, do not scale well with long sequences, resulting in O(L^2) computational complexity for the sequence length L. In this thesis, we propose a holistic approach for reducing this complexity to linear O(L) via an unbiased approximation of the softmax kernel appearing in self-attention, the main component in the Transformer backbone. The obtained efficient Transformer architecture is referred to as Performer. Compared to other developments in the area of efficient Transformers, Performer is theory-grounded and the problem of long-sequence processing can be reduced to a theoretical derivation of random features with certain properties minimizing the variance of the approximation. This thesis describes an evolution of mechanisms for random-feature approximation: from the so-called FAVOR, to FAVOR+ and, finally, to FAVOR++. The FAVOR++ mechanism has the tightest concentration properties and the best performance in practice. On the way to FAVOR++, we also discuss several extensions of the proposed efficient self-attention mechanism, among which are masked self-attention, sub-linear memory Performers, generalized attention, and more. For each proposed method, this thesis contains empirical evaluations in real-life large-scale learning setups and thorough theoretical analyses with proofs.