Random Features for Efficient Attention Approximation

Likhosherstov, Valerii

doi:https://doi.org/10.17863/CAM.96501

Random Features for Efficient Attention Approximation

Repository URI

https://www.repository.cam.ac.uk/handle/1810/349393

Repository DOI

https://doi.org/10.17863/CAM.96501

Files

Thesis (16.33 MB)

Type

Thesis

Authors

Likhosherstov, Valerii

Abstract

Transformers are, perhaps, the most widespread architectures in today's landscape of deep learning. They, however, do not scale well with long sequences, resulting in O(L^2) computational complexity for the sequence length L. In this thesis, we propose a holistic approach for reducing this complexity to linear O(L) via an unbiased approximation of the softmax kernel appearing in self-attention, the main component in the Transformer backbone. The obtained efficient Transformer architecture is referred to as Performer. Compared to other developments in the area of efficient Transformers, Performer is theory-grounded and the problem of long-sequence processing can be reduced to a theoretical derivation of random features with certain properties minimizing the variance of the approximation. This thesis describes an evolution of mechanisms for random-feature approximation: from the so-called FAVOR, to FAVOR+ and, finally, to FAVOR++. The FAVOR++ mechanism has the tightest concentration properties and the best performance in practice. On the way to FAVOR++, we also discuss several extensions of the proposed efficient self-attention mechanism, among which are masked self-attention, sub-linear memory Performers, generalized attention, and more. For each proposed method, this thesis contains empirical evaluations in real-life large-scale learning setups and thorough theoretical analyses with proofs.

Date

2022-12-01

Advisors

Weller, Adrian

Keywords

deep learning, self-attention, Transformer networks

Qualification

Doctor of Philosophy (PhD)

Awarding Institution

University of Cambridge

Rights and licensing

Sponsorship

Cambridge University Department of Engineering and Cambridge Trust

Collections

Theses - Engineering

Random Features for Efficient Attention Approximation

Repository URI

Repository DOI

Files

Type

Change log

Authors

Abstract

Description

Date

Advisors

Keywords

Qualification

Awarding Institution

Rights and licensing

Sponsorship

Collections