Transformer (deep learning)

{{Short description|Algorithm for modelling sequential data}} thumb|A standard transformer architecture, showing on the left an encoder, and on the right a decoder. Note: it uses the pre-LN convention, which is different from the post-LN convention used in the original 2017 transformer. {{Machine learning|Neural networks}} In deep learning, the '''transformer''' is a family of artificial neural network architectures based on the multi-head attention mechanism, in which text is converted to numerical representations called tokens, and each token is converted into a vector via lookup from a word embedding table.<ref name="2017_Attention_Is_All_You_Need">{{cite journal |last1=Vaswani |first1=Ashish |author1-link=Ashish Vaswani |last2=Shazeer |first2=Noam |last3=Parmar |first3=Niki |last4=Uszkoreit |first4=Jakob |last5=Jones |first5=Llion |last6=Gomez |first6=Aidan N |author6-link=Aidan Gomez |last7=Kaiser |first7=Łukasz |last8=Polosukhin |first8=Illia |date=2017 |title=Attention is All you Need |url=https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=30}}</ref> At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism, allowing the signal for key tokens to be amplified and less important tokens to be diminished. Because self-attention alone is permutation-invariant, transformers inject positional information, typically through positional encodings or learned positional embeddings, so token order can affect the output.<ref>{{cite conference|last1=Vaswani|first1=Ashish|last2=Shazeer|first2=Noam|last3=Parmar|first3=Niki|last4=Uszkoreit|first4=Jakob|last5=Jones|first5=Llion|last6=Gomez|first6=Aidan N.|last7=Kaiser|first7=Lukasz|last8=Polosukhin|first8=Illia|title=Attention Is All You Need|book-title=Advances in Neural Information Processing Systems|year=2017|url=https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf|access-date=2026-05-05}}</ref>

Transformers have the advantage of having no recurrent units, therefore requiring less training time than earlier recurrent neural architectures (RNNs) such as long short-term memory (LSTM).<ref name="lstm1997">{{cite journal |last1=Hochreiter |first1=Sepp |author-link=Sepp Hochreiter |last2=Schmidhuber |first2=Jürgen |author-link2=Jürgen Schmidhuber |title=Long Short-Term Memory |journal=Neural Computation |date=November 1997 |volume=9 |issue=8 |pages=1735–1780 |doi=10.1162/neco.1997.9.8.1735 |pmid=9377276 }}</ref> Later variations have been widely adopted for training large language models (LLMs) on large (language) datasets.<ref name=":7">{{cite web|url=https://openai.com/blog/better-language-models/|title=Better Language Models and Their Implications|date=2019-02-14|website=OpenAI|access-date=2019-08-25|archive-date=2020-12-19|archive-url=https://web.archive.org/web/20201219132206/https://openai.com/blog/better-language-models/|url-status=live}}</ref> Modern transformer designs are commonly grouped into encoder-only, decoder-only, and encoder-decoder variants, depending on whether they are optimized for representation learning, autoregressive generation, or conditional sequence-to-sequence tasks.<ref>{{cite web|title=Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer|url=https://arxiv.org/abs/1910.10683|website=arXiv|date=2019-10-23|access-date=2026-05-05}}</ref>

The original version of the transformer architecture was proposed in the 2017 paper "Attention Is All You Need" by researchers at Google.<ref name="2017_Attention_Is_All_You_Need" /> The predecessors of transformers were developed as an improvement over previous architectures for machine translation,<ref name="inventors">{{cite arXiv |eprint=1409.0473 |class=cs.CL |last1=Bahdanau |first2=Kyunghyun |last2=Cho |title=Neural Machine Translation by Jointly Learning to Align and Translate |date=September 1, 2014 |last3=Bengio |first3=Yoshua}}</ref><ref name="inventconfirm">{{cite arXiv |eprint=1508.04025 |class=cs.CL |first1=Minh-Thang |last1=Luong |first2=Hieu |last2=Pham |title=Effective Approaches to Attention-based Neural Machine Translation |date=August 17, 2015 |last3=Manning |first3=Christopher D.}}</ref> but have found many applications since. They are used in large-scale natural language processing, computer vision (vision transformers), reinforcement learning,<ref name=":10" /><ref>{{Cite journal |last1=Parisotto |first1=Emilio |last2=Song |first2=Francis |last3=Rae |first3=Jack |last4=Pascanu |first4=Razvan |last5=Gulcehre |first5=Caglar |last6=Jayakumar |first6=Siddhant |last7=Jaderberg |first7=Max |last8=Kaufman |first8=Raphaël Lopez |last9=Clark |first9=Aidan |last10=Noury |first10=Seb |last11=Botvinick |first11=Matthew |last12=Heess |first12=Nicolas |last13=Hadsell |first13=Raia |date=2020-11-21 |title=Stabilizing Transformers for Reinforcement Learning |url=https://proceedings.mlr.press/v119/parisotto20a.html |journal=Proceedings of the 37th International Conference on Machine Learning |language=en |publisher=PMLR |pages=7487–7498}}</ref> audio,<ref name="Robust Speech Recognition via Large-Scale Weak Supervision">{{cite arXiv|eprint=2212.04356 |last1=Radford |first1=Alec |author2=Jong Wook Kim |last3=Xu |first3=Tao |last4=Brockman |first4=Greg |last5=McLeavey |first5=Christine |last6=Sutskever |first6=Ilya |title=Robust Speech Recognition via Large-Scale Weak Supervision |year=2022 |class=eess.AS }}</ref> multimodal learning, robotics,<ref>{{Cite journal |last1=Monastirsky |first1=Maxim |last2=Azulay |first2=Osher |last3=Sintov |first3=Avishai |date=February 2023 |title=Learning to Throw With a Handful of Samples Using Decision Transformers |journal=IEEE Robotics and Automation Letters |volume=8 |issue=2 |pages=576–583 |doi=10.1109/LRA.2022.3229266 |bibcode=2023IRAL....8..576M }}</ref>, playing chess<ref name="grandmaster">{{cite arXiv |last1=Ruoss |first1=Anian |last2=Delétang |first2=Grégoire |last3=Medapati |first3=Sourabh |last4=Grau-Moya |first4=Jordi |last5=Wenliang |first5=Li |last6=Catt |first6=Elliot |last7=Reid |first7=John |last8=Genewein |first8=Tim |date=2024-02-07 |title=Grandmaster-Level Chess Without Search |class=cs.LG |eprint=2402.04494v1}}</ref> and at disaster responses<ref name="disaster">{{cite journal |last1=Maity |first1=Abhishek |title=CrisisSense: Transforming Social Signals into Real-Time Disaster Awareness via Deep Neural Intelligence |journal=2026 IEEE Madhya Pradesh Section Conference (MPCON) |date=March 2026 |pages=1501–1506 |doi=10.1109/MPCON69668.2026.11508516 |url=https://ieeexplore.ieee.org/document/11508516}}</ref>. It has also led to the development of pre-trained systems, such as generative pre-trained transformers (GPTs)<ref name="wolf2020">{{cite book|last1=Wolf |first1=Thomas| last2=Debut |first2=Lysandre| last3=Sanh |first3=Victor |last4=Chaumond |first4=Julien| last5=Delangue |first5=Clement| last6=Moi |first6=Anthony |last7=Cistac |first7=Pierric |last8=Rault |first8=Tim |last9=Louf |first9=Remi |last10=Funtowicz |first10=Morgan |last11=Davison |first11=Joe |last12=Shleifer |first12=Sam |last13=von Platen |first13=Patrick |last14=Ma |first14=Clara |last15=Jernite |first15=Yacine |last16=Plu |first16=Julien |last17=Xu |first17=Canwen |last18=Le Scao |first18=Teven |last19=Gugger |first19=Sylvain |last20=Drame |first20=Mariama |last21=Lhoest |first21=Quentin |last22=Rush |first22=Alexander |title=Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations |chapter=Transformers: State-of-the-Art Natural Language Processing |year=2020 |pages=38–45 |doi=10.18653/v1/2020.emnlp-demos.6 }}</ref> and BERT<ref name=":6">{{cite web|url=http://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html|title=Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing|website=Google AI Blog|date=2 November 2018 |access-date=2019-08-25|archive-date=2021-01-13|archive-url=https://web.archive.org/web/20210113211449/https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html|url-status=live}}</ref> (bidirectional encoder representations from transformers).{{TOC limit|3}}

== History == {{tone|section|date=February 2026}} {{See also|Timeline of machine learning}}

=== Predecessors === For many years, sequence modelling and generation was done by using plain recurrent neural networks (RNNs). A well-cited early example was the Elman network (1990). In theory, the information from one token can propagate arbitrarily far down the sequence, but in practice the vanishing-gradient problem leaves the model's state at the end of a long sentence without precise, extractable information about preceding tokens.

A key breakthrough was LSTM (originally described in a 1995 technical report and formally published in 1997),<ref name="lstm1997" />{{NoteTag|Gated recurrent units (2014) further reduced its complexity.}} an RNN that introduced gating mechanisms to mitigate the vanishing gradient problem, allowing efficient learning of long-sequence modelling. One key architectural element was the use of ''multiplicative gating units'', in which the outputs of some neurons modulate the outputs of others. These multiplicative units are conceptually distinct from the additive attention mechanism later introduced for sequence-to-sequence models. <ref>{{cite journal |last1=Feldman |first1=J |last2=Ballard |first2=D |title=Connectionist models and their properties |journal=Cognitive Science |date=September 1982 |volume=6 |issue=3 |pages=205–254 |doi=10.1016/S0364-0213(82)80001-3 }}</ref> Neural networks using multiplicative units were later called ''sigma-pi networks''<ref name="PDP">{{Cite book |last1=Rumelhart |first1=David E. |url=https://stanford.edu/~jlmcc/papers/PDP/Chapter2.pdf |title=Parallel Distributed Processing, Volume 1: Explorations in the Microstructure of Cognition: Foundations, Chapter 2 |last2=McClelland |first2=James L. |last3=Hinton |first3=Geoffrey E. |date=1987-07-29 |publisher=Bradford Books |isbn=978-0-262-68053-0 |location=Cambridge, Mass |language=en}}</ref> or ''higher-order networks''.<ref>{{cite journal |last1=Giles |first1=C. Lee |last2=Maxwell |first2=Tom |title=Learning, invariance, and generalization in high-order neural networks |journal=Applied Optics |date=December 1987 |volume=26 |issue=23 |pages=4972–4978 |doi=10.1364/AO.26.004972 |pmid=20523475 }}</ref> LSTM became the standard architecture for long sequence modelling until the 2017 publication of transformers. However, LSTM still used sequential processing, like most other RNNs.{{NoteTag|Some architectures, such as RWKV (Receptance Weighted Key Value) or state space models, avoid the issue.}} Specifically, RNNs operate one token at a time from first to last; they cannot operate in parallel over all tokens in a sequence.

Modern transformers overcome this problem, but unlike RNNs, they require computation time that is quadratic in the size of the context window. The linearly scaling fast weight controller (1992) learns to compute a weight matrix for further processing depending on the input.<ref name="transform19922">{{cite journal |last1=Schmidhuber |first1=Jürgen |title=Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Networks |journal=Neural Computation |date=January 1992 |volume=4 |issue=1 |pages=131–139 |doi=10.1162/neco.1992.4.1.131 }}</ref> One of its two networks has "fast weights" or "dynamic links" (1981).<ref name="malsburg1981">Christoph von der Malsburg: The correlation theory of brain function. Internal Report 81-2, MPI Biophysical Chemistry, 1981. http://cogprints.org/1380/1/vdM_correlation.pdf See Reprint in Models of Neural Networks II, chapter 2, pages 95–119. Springer, Berlin, 1994.</ref><ref name="feldman1982">{{cite journal |last1=Feldman |first1=Jerome A. |title=Dynamic connections in neural networks |journal=Biological Cybernetics |date=December 1982 |volume=46 |issue=1 |pages=27–39 |doi=10.1007/BF00335349 |pmid=6307398 }}</ref><ref>{{Cite journal |last1=Hinton |first1=Geoffrey E. |last2=Plaut |first2=David C. |date=1987 |title=Using Fast Weights to Deblur Old Memories |url=https://escholarship.org/uc/item/0570j1dp |journal=Proceedings of the Annual Meeting of the Cognitive Science Society |language=en |volume=9}}</ref> A slow neural network learns by gradient descent to generate keys and values for computing the weight changes of the fast neural network which computes answers to queries.<ref name="transform19922"/> This was later shown to be equivalent to the unnormalized linear transformer.<ref name="fastlinear20202">{{cite conference |last1=Katharopoulos |first1=Angelos |last2=Vyas |first2=Apoorv |last3=Pappas |first3=Nikolaos |last4=Fleuret |first4=François |date=2020 |title=Transformers are RNNs: Fast autoregressive Transformers with linear attention |url=https://proceedings.mlr.press/v119/katharopoulos20a.html |publisher=PMLR |pages=5156–5165 |book-title=ICML 2020}}</ref><ref name="schlag20212">{{cite conference |last1=Schlag |first1=Imanol |last2=Irie |first2=Kazuki |last3=Schmidhuber |first3=Jürgen |author-link3=Juergen Schmidhuber |date=2021 |title=Linear Transformers Are Secretly Fast Weight Programmers |url= https://icml.cc/virtual/2021/spotlight/10588 |publisher=Springer |pages=9355–9366 |book-title=ICML 2021}}</ref>

=== Attention with seq2seq === {{Main|Seq2seq#History}} The idea of encoder–decoder sequence transduction had been developed in the early 2010s; commonly cited as the originators that produced seq2seq are two concurrently published papers from 2014.<ref name=":22">{{Cite book |last1=Cho |first1=Kyunghyun |last2=van Merriënboer |first2=Bart |last3=Gulcehre |first3=Caglar |last4=Bahdanau |first4=Dzmitry |last5=Bougares |first5=Fethi |last6=Schwenk |first6=Holger |last7=Bengio |first7=Yoshua |chapter=Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation |date=October 2014 |editor-last=Moschitti |editor-first=Alessandro |editor2-last=Pang |editor2-first=Bo |editor3-last=Daelemans |editor3-first=Walter |title=Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) |chapter-url=https://aclanthology.org/D14-1179 |location=Doha, Qatar |publisher=Association for Computational Linguistics |pages=1724–1734 |doi=10.3115/v1/D14-1179|arxiv=1406.1078 }}</ref><ref name="sequence">{{cite arXiv |eprint=1409.3215 |class=cs.CL |first1=Ilya |last1=Sutskever |first2=Oriol |last2=Vinyals |title=Sequence to sequence learning with neural networks |date=14 Dec 2014 |last3=Le |first3=Quoc Viet}} [first version posted to arXiv on 10 Sep 2014]</ref>{{or-inline|date=February 2026|reason=who says this is 'commonly cited as the originators'? Please cite reliable, independent sources for this information}}

A 380M-parameter model for machine translation uses two long short-term memories (LSTM).<ref name="sequence" /> Its architecture consists of two parts. The ''encoder'' is an LSTM that takes in a sequence of tokens and turns it into a vector. The ''decoder'' is another LSTM that converts the vector into a sequence of tokens. Similarly, another 130M-parameter model used gated recurrent units (GRU) instead of LSTM.<ref name=":22" /> Later research showed that GRUs are neither better nor worse than LSTMs for seq2seq.<ref name="MyUser_Arxiv.org_May_18_2016c">{{cite arXiv |eprint=1412.3555 |class=cs.NE |first1=Junyoung |last1=Chung |first2=Caglar |last2=Gulcehre |title=Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling |last3=Cho |first3=KyungHyun |last4=Bengio |first4=Yoshua |year=2014}}</ref><ref name="gruber_jockisch">{{cite journal |last1=Gruber |first1=Nicole |last2=Jockisch |first2=Alfred |title=Are GRU Cells More Specific and LSTM Cells More Sensitive in Motive Classification of Text? |journal=Frontiers in Artificial Intelligence |date=30 June 2020 |volume=3 |article-number=40 |doi=10.3389/frai.2020.00040 |pmid=33733157 |pmc=7861254 |doi-access=free }}</ref>

These early seq2seq models had no attention mechanism, and the state vector is accessible only after the ''last'' word of the source text was processed. Although in theory such a vector retains the information about the whole original sentence, in practice the information is poorly preserved. This is because the input is processed sequentially by one recurrent network into a ''fixed''-size output vector, which is then processed by another recurrent network into an output. If the input is long, then the output vector would not be able to contain all relevant information, degrading the output. As evidence, reversing the input sentence improved seq2seq translation.<ref>{{Cite journal |last1=Sutskever |first1=Ilya |last2=Vinyals |first2=Oriol |last3=Le |first3=Quoc V |date=2014 |title=Sequence to Sequence Learning with Neural Networks |url=https://proceedings.neurips.cc/paper/2014/hash/a14ac55a4f27472c5d894ec1c3c743d2-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=27|arxiv=1409.3215 }}</ref>

The ''RNN search'' model introduced an attention mechanism to seq2seq for machine translation to solve the bottleneck problem (of the ''fixed-size'' output vector), allowing the model to process long-distance dependencies more easily. The name is because it "emulates searching through a source sentence during decoding a translation".<ref name="inventors" />

The relative performances were compared between global (that of ''RNN search'') and local (sliding window) attention model architectures for machine translation, finding that mixed attention had higher quality than global attention, while local attention reduced translation time.<ref>{{Cite arXiv |eprint=1508.04025 |class=cs.CL |first1=Minh-Thang |last1=Luong |first2=Hieu |last2=Pham |title=Effective Approaches to Attention-based Neural Machine Translation |date=2015 |last3=Manning |first3=Christopher D.}}</ref>

In 2016, Google Translate was revamped to Google Neural Machine Translation, which replaced the previous model based on statistical machine translation. The new model was a seq2seq model where the encoder and the decoder were both 8 layers of bidirectional LSTM.<ref name="Y4moj">{{cite arXiv |eprint=1609.08144 |class=cs.CL |first1=Yonghui |last1=Wu |first2=Mike |last2=Schuster |title=Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation |date=2016-09-01 |display-authors=1 |last3=Chen |first3=Zhifeng |last4=Le |first4=Quoc V. |last5=Norouzi |first5=Mohammad |last6=Macherey |first6=Wolfgang |last7=Krikun |first7=Maxim |last8=Cao |first8=Yuan |last9=Gao |first9=Qin |last10=Macherey |first10=Klaus |last11=Klingner |first11=Jeff |last12=Shah |first12=Apurva |last13=Johnson |first13=Melvin |last14=Liu |first14=Xiaobing |last15=Kaiser |first15=Łukasz}}</ref> It took nine months to develop, and it outperformed the statistical approach, which took ten years to develop.<ref name="UJDu8">{{cite news |last=Lewis-Kraus |first=Gideon |date=2016-12-14 |title=The Great A.I. Awakening |url=https://www.nytimes.com/2016/12/14/magazine/the-great-ai-awakening.html |archive-url=https://web.archive.org/web/20230524052626/https://www.nytimes.com/2016/12/14/magazine/the-great-ai-awakening.html |archive-date=24 May 2023 |access-date=2023-06-22 |work=The New York Times }}</ref>

=== Parallelizing attention === {{main|Attention (machine learning)#History}} Seq2seq models with attention (including self-attention) still suffered from the same issue with recurrent networks, which is that they are hard to parallelize, which prevented them from being accelerated on GPUs. In 2016, ''decomposable attention'' applied a self-attention mechanism to feedforward networks, which are easy to parallelize, and achieved SOTA result in textual entailment with an order of magnitude fewer parameters than LSTMs.<ref>{{cite arXiv|last1=Parikh |first1=Ankur P. |title=A Decomposable Attention Model for Natural Language Inference |date=2016-09-25 |eprint=1606.01933 |last2=Täckström |first2=Oscar |last3=Das |first3=Dipanjan |last4=Uszkoreit |first4=Jakob|class=cs.CL }}</ref> One of its authors, Jakob Uszkoreit, suspected that attention ''without'' recurrence would be sufficient for language translation, thus the title "attention is ''all'' you need".<ref name=":11">{{Cite magazine |last=Levy |first=Steven |title=8 Google Employees Invented Modern AI. Here's the Inside Story |url=https://www.wired.com/story/eight-google-employees-invented-modern-ai-transformers-paper/ |url-status=live |archive-url=https://web.archive.org/web/20240320101528/https://www.wired.com/story/eight-google-employees-invented-modern-ai-transformers-paper/ |archive-date=20 Mar 2024 |access-date=2024-08-06 |magazine=Wired }}</ref> That hypothesis was against conventional wisdom at the time, and even his father Hans Uszkoreit, a well-known computational linguist, was skeptical.<ref name=":11" /> In the same year, self-attention (called ''intra-attention or'' ''intra-sentence attention'') was proposed for LSTMs.<ref>{{Cite book |last1=Cheng |first1=Jianpeng |title=Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing |last2=Dong |first2=Li |last3=Lapata |first3=Mirella |date=November 2016 |publisher=Association for Computational Linguistics |editor-last=Su |editor-first=Jian |location=Austin, Texas |pages=551–561 |chapter=Long Short-Term Memory-Networks for Machine Reading |doi=10.18653/v1/D16-1053 |editor2-last=Duh |editor2-first=Kevin |editor3-last=Carreras |editor3-first=Xavier |chapter-url=https://aclanthology.org/D16-1053/}}</ref>

On 2017-06-12, the original (100M-parameter) encoder–decoder transformer model was published in the "Attention is all you need" paper. At the time, the focus of the research was on improving seq2seq for machine translation, by removing its recurrence to process all tokens in parallel, but preserving its dot-product attention mechanism to keep its text processing performance.<ref name="2017_Attention_Is_All_You_Need" /> This led to the introduction of a multi-head attention model that was easier to parallelize due to the use of independent heads and the lack of recurrence. Its parallelizability was an important factor to its widespread use in large neural networks.<ref>{{Citation |last1=Peng |first1=Bo |title=RWKV: Reinventing RNNs for the transformer Era |date=2023-12-10 |arxiv=2305.13048 |last2=Alcaide |first2=Eric |last3=Anthony |first3=Quentin |last4=Albalak |first4=Alon |last5=Arcadinho |first5=Samuel |last6=Biderman |first6=Stella |last7=Cao |first7=Huanqi |last8=Cheng |first8=Xin |last9=Chung |first9=Michael}}</ref>

=== AI boom era === {{anchor|Transformer boom}}As early as spring 2017, even before the "Attention is all you need" preprint was published, one of the co-authors applied the "decoder-only" variation of the architecture to generate fictitious Wikipedia articles.<ref>{{Cite magazine |last=Marche |first=Stephen |date=2024-08-23 |title=Was Linguistic A.I. Created by Accident? |url=https://www.newyorker.com/science/annals-of-artificial-intelligence/was-linguistic-ai-created-by-accident |access-date=2024-08-27 |magazine=The New Yorker }}</ref> Transformer architecture is now used alongside many generative models that contribute to the ongoing AI boom.

The "reference implementation" of the original Transformer was written in a TensorFlow library.<ref>{{Cite journal |last1=Vaswani |first1=Ashish |last2=Bengio |first2=Samy |last3=Brevdo |first3=Eugene |last4=Chollet |first4=Francois |last5=Gomez |first5=Aidan |last6=Gouws |first6=Stephan |last7=Jones |first7=Llion |last8=Kaiser |first8=Łukasz |last9=Kalchbrenner |first9=Nal |last10=Parmar |first10=Niki |last11=Sepassi |first11=Ryan |last12=Shazeer |first12=Noam |last13=Uszkoreit |first13=Jakob |date=March 2018 |editor-last=Cherry |editor-first=Colin |editor2-last=Neubig |editor2-first=Graham |title=Tensor2Tensor for Neural Machine Translation |url=https://aclanthology.org/W18-1819/ |journal=Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track) |location=Boston, MA |publisher=Association for Machine Translation in the Americas |pages=193–199}}</ref><ref>{{cite web |last=Kaiser |first=Łukasz |date=2017-06-19 |title=Accelerating Deep Learning Research with the Tensor2Tensor Library |url=https://research.google/blog/accelerating-deep-learning-research-with-the-tensor2tensor-library/ |website=Google Research Blog }}</ref> In language modelling, ELMo (2018) was a bi-directional LSTM that produces contextualized word embeddings, improving upon the line of research from bag of words and word2vec. It was followed by BERT (2018), an encoder-only transformer model.<ref name=":03">{{cite arXiv |eprint=1810.04805v2 |class=cs.CL |first1=Jacob |last1=Devlin |first2=Ming-Wei |last2=Chang |title=BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding |date=11 October 2018 |last3=Lee |first3=Kenton |last4=Toutanova |first4=Kristina}}</ref> In October 2019, Google started using BERT to process search queries.<ref>{{Cite web |date=2020-10-15 |title=Google: BERT now used on almost every English query |url=https://searchengineland.com/google-bert-used-on-almost-every-english-query-342193 |access-date=2020-11-24 |website=Search Engine Land}}</ref> In 2020, Google Translate replaced the previous RNN-encoder–RNN-decoder model by a transformer-encoder–RNN-decoder model.<ref name="gtrans">{{Cite web |date=June 8, 2020 |title=Recent Advances in Google Translate |url=https://research.google/blog/recent-advances-in-google-translate/ |first1=Isaac |last1=Caswell |first2=Bowen |last2=Liang |url-status=live |archive-url=https://web.archive.org/web/20240704042433/https://research.google/blog/recent-advances-in-google-translate/ |archive-date=4 Jul 2024 |access-date=2024-08-07 |website=Google Research |language=en}}</ref>

Starting in 2018, the OpenAI GPT series of decoder-only transformers became state of the art in natural language generation. At the end of 2022, ChatGPT, a chatbot based on a fine-tuned variant of GPT-3.5, became unexpectedly<ref>{{Cite web |title=The inside story of how ChatGPT was built from the people who made it |url=https://www.technologyreview.com/2023/03/03/1069311/inside-story-oral-history-how-chatgpt-built-openai/ |access-date=2024-08-06 |website=MIT Technology Review |language=en}}</ref><ref>{{Cite web |title=Introducing ChatGPT |url=https://openai.com/index/chatgpt/ |website=OpenAI |date=2022-11-30 |access-date=2026-05-16}}</ref> popular, triggering a boom around large language models.<ref name="gpt12">{{cite web |date=June 11, 2018 |title=Improving language understanding with unsupervised learning |url=https://openai.com/research/language-unsupervised |url-status=live |archive-url=https://web.archive.org/web/20230318210736/https://openai.com/research/language-unsupervised |archive-date=2023-03-18 |access-date=2023-03-18 |website=openai.com}}</ref><ref name="ngEG3">{{Citation |title=finetune-transformer-lm |date=June 11, 2018 |url=https://github.com/openai/finetune-transformer-lm |access-date=2023-05-01 |publisher=OpenAI}}</ref>

Transformers have been applied in modalities beyond text. Four days after the publication of "Attention is All You Need", a multimodal transformer architecture, MultiModel, was published by most authors of that paper.<ref>{{Cite arXiv |last1=Kaiser|first1=Lukasz|last2=Gomez|first2=Aidan N.|last3=Shazeer|first3=Noam|last4=Vaswani|first4=Ashish|last5=Parmar|first5=Niki|last6=Jones|first6=Llion|last7=Uszkoreit|first7=Jakob|date=2017-06-16|title=One Model To Learn Them All|class=cs.LG |eprint=1706.05137v1 |language=en}}</ref> Other examples include the vision transformer,<ref name="auto2">{{cite arXiv |eprint=2010.11929 |class=cs.CV |first1=Alexey |last1=Dosovitskiy |first2=Lucas |last2=Beyer |title=An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale |date=2021-06-03 |last3=Kolesnikov |first3=Alexander |last4=Weissenborn |first4=Dirk |last5=Zhai |first5=Xiaohua |last6=Unterthiner |first6=Thomas |last7=Dehghani |first7=Mostafa |last8=Minderer |first8=Matthias |last9=Heigold |first9=Georg |last10=Gelly |first10=Sylvain |last11=Uszkoreit |first11=Jakob}}</ref> speech recognition,<ref name="Gulati2020" /> robotics,<ref name=":10">{{Citation |last1=Chen |first1=Lili |title=Decision Transformer: Reinforcement Learning via Sequence Modeling |date=2021-06-24 |arxiv=2106.01345 |last2=Lu |first2=Kevin |last3=Rajeswaran |first3=Aravind |last4=Lee |first4=Kimin |last5=Grover |first5=Aditya |last6=Laskin |first6=Michael |last7=Abbeel |first7=Pieter |last8=Srinivas |first8=Aravind |last9=Mordatch |first9=Igor}}</ref> and multimodal.<ref name="choromanski2020">{{Citation |last1=Choromanski |first1=Krzysztof |title=Rethinking Attention with Performers |date=2022-11-19 |arxiv=2009.14794 |last2=Likhosherstov |first2=Valerii |last3=Dohan |first3=David |last4=Song |first4=Xingyou |last5=Gane |first5=Andreea |last6=Sarlos |first6=Tamas |last7=Hawkins |first7=Peter |last8=Davis |first8=Jared |last9=Mohiuddin |first9=Afroz}}</ref> The vision transformer, in turn, stimulated new developments in convolutional neural networks.<ref>{{Cite conference |last1=Liu |first1=Zhuang |last2=Mao |first2=Hanzi |last3=Wu |first3=Chao-Yuan |last4=Feichtenhofer |first4=Christoph |last5=Darrell |first5=Trevor |last6=Xie |first6=Saining |date=2022 |conference=Conference on Computer Vision and Pattern Recognition (CVPR) |title=A ConvNet for the 2020s |url=https://openaccess.thecvf.com/content/CVPR2022/html/Liu_A_ConvNet_for_the_2020s_CVPR_2022_paper.html |language=en |pages=11976–11986}}</ref> Image and video generators like DALL-E (2021), Stable Diffusion 3 (2024),<ref name=":62">{{Citation |last1=Esser |first1=Patrick |title=Scaling Rectified Flow Transformers for High-Resolution Image Synthesis |date=2024-03-05 |arxiv=2403.03206 |last2=Kulal |first2=Sumith |last3=Blattmann |first3=Andreas |last4=Entezari |first4=Rahim |last5=Müller |first5=Jonas |last6=Saini |first6=Harry |last7=Levi |first7=Yam |last8=Lorenz |first8=Dominik |last9=Sauer |first9=Axel}}</ref> and Sora (2024), use transformers to analyse input data (like text prompts) by breaking it down into "tokens" and then calculating the relevance between each token using self-attention, which helps the model understand the context and relationships within the data.

== Training == === Methods for stabilizing training === The plain transformer architecture had difficulty in converging. In the original paper,<ref name="2017_Attention_Is_All_You_Need" /> the authors recommended using learning rate warmup. That is, the learning rate should linearly scale up from 0 to maximal value for the first part of the training (usually recommended to be 2% of the total number of training steps), before decaying again.

A 2020 paper found that using layer normalization ''before'' (instead of after) multihead attention and feedforward layers stabilizes training, not requiring learning rate warmup.<ref name="auto1">{{cite arXiv |eprint=2002.04745 |class=cs.LG |first1=Ruibin |last1=Xiong |first2=Yunchang |last2=Yang |title=On Layer Normalization in the Transformer Architecture |date=2020-06-29 |last3=He |first3=Di |last4=Zheng |first4=Kai |last5=Zheng |first5=Shuxin |last6=Xing |first6=Chen |last7=Zhang |first7=Huishuai |last8=Lan |first8=Yanyan |last9=Wang |first9=Liwei |last10=Liu |first10=Tie-Yan}}</ref> This is the "pre-LN Transformer" and is more commonly used, compared to the original "post-LN Transformer".

=== Pretrain-finetune === Transformers typically are first pretrained by self-supervised learning on a large generic dataset, followed by supervised fine-tuning on a small task-specific dataset. The pretrain dataset is typically an unlabeled large corpus, such as The Pile. Tasks for pretraining and fine-tuning commonly include: * language modeling<ref name=":6" /> * next-sentence prediction<ref name=":6" /> * question answering<ref name=":7" /> * reading comprehension * sentiment analysis<ref name="2017_Attention_Is_All_You_Need" /> * paraphrasing<ref name="2017_Attention_Is_All_You_Need" /> The T5 transformer report<ref name="Raffel Shazeer 2020 Exploring the Limits">{{cite journal |last1=Raffel |first1=Colin |last2=Shazeer |first2=Noam |last3=Roberts |first3=Adam |last4=Lee |first4=Katherine |last5=Narang |first5=Sharan |last6=Matena |first6=Michael |last7=Zhou |first7=Yanqi |last8=Li |first8=Wei |last9=Liu |first9=Peter J. |title=Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer |journal=Journal of Machine Learning Research |date=2020 |volume=21 |issue=140 |pages=1–67 |url=https://www.jmlr.org/papers/v21/20-074.html }}</ref> documents a large number of natural language pretraining tasks. Some examples are:

* restoring or repairing incomplete or corrupted text. For example, the input, ''"Thank you{{nnbsp|~~}}me to your party{{nnbsp|~~}}week",'' might generate the output, ''"Thank you '''for inviting''' me to your party '''last''' week".'' * translation between natural languages (machine translation) * judging the pragmatic acceptability of natural language. For example, the following sentence might be judged "not acceptable",<ref>{{cite arXiv | eprint=1910.10683 | last1=Raffel | first1=Colin | last2=Shazeer | first2=Noam | last3=Roberts | first3=Adam | last4=Lee | first4=Katherine | last5=Narang | first5=Sharan | last6=Matena | first6=Michael | last7=Zhou | first7=Yanqi | last8=Li | first8=Wei | last9=Liu | first9=Peter J. | title=Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer | date=2019 | class=cs.LG }}</ref> because even though it is syntactically well-formed, it is improbable in ordinary human usage: ''The course is jumping well.''

Note that while each of these tasks is trivial or obvious for human native speakers of the language (or languages), they have typically proved challenging for previous generations of machine learning architecture.

=== Tasks === {{See also|Large language model#Evaluation}} In general, there are 3 classes of language modelling tasks: "masked",<ref name=":5">{{Cite web |title=Masked language modeling |url=https://huggingface.co/docs/transformers/tasks/masked_language_modeling |access-date=2023-10-05 |website=huggingface.co}}</ref> "autoregressive",<ref name=":8">{{Cite web |title=Causal language modeling |url=https://huggingface.co/docs/transformers/tasks/language_modeling |access-date=2023-10-05 |website=huggingface.co}}</ref> and "prefixLM".<ref name=":4" /> These classes are independent of a specific modeling architecture such as transformer, but they are often discussed in the context of transformer.

In a masked task,<ref name=":5" /> one or more of the tokens is masked out, and the model would produce a probability distribution predicting what the masked-out tokens are based on the context. The loss function for the task is typically sum of log-perplexities for the masked-out tokens: <math display="block">\text{Loss} = -\sum_{t\in\text{masked tokens}}\ln(\text{probability of }t\text{ conditional on its context}) </math>and the model is trained to minimize this loss function. The BERT series of models are trained for masked token prediction and another task.

In an autoregressive task,<ref name=":8" /> the entire sequence is masked at first, and the model produces a probability distribution for the first token. Then the first token is revealed and the model predicts the second token, and so on. The loss function for the task is still typically the same. The GPT series of models are trained by autoregressive tasks.

In a prefixLM task,<ref name=":4" /> the sequence is divided into two parts. The first part is presented as context, and the model predicts the first token of the second part. Then that would be revealed, and the model predicts the second token, and so on. The loss function for the task is still typically the same. The T5 series of models are trained by prefixLM tasks.

Note that "masked" as in "masked language modelling" is not "masked" as in "masked attention", and "prefixLM" as in "prefix language modeling" is not "prefixLM" as in " prefix language model".

== Architecture == All transformers have the same primary components: * Tokenizers, which convert text into tokens. * Embedding layer, which converts tokens and positions of the tokens into vector representations. * Transformer layers, which carry out repeated transformations on the vector representations, extracting more and more linguistic information. These consist of alternating attention and feedforward layers. There are two major types of transformer layers: encoder layers and decoder layers, with further variants. * Un-embedding layer, which converts the final vector representations back to a probability distribution over the tokens. The following description follows exactly the transformer as described in the original paper. There are variants, described in the following section.

By convention, we write all vectors as row vectors. For example, pushing a vector through a linear layer means multiplying it by a weight matrix on the right, as <math>xW</math>.

=== Tokenization === As the transformer architecture natively consists of operations over numbers (matrix multiplications, dot products, activation functions) rather than over text, there must first be a mapping from any input text to some numerical representation. This happens in three steps.

First, the input text is treated by a ''preprocessor'', which performs both textual transformations and splits the text into coarse-grained segments called ''pretokens''. The latter is referred to as ''pretokenization''. Second, each pretoken is segmented further into ''tokens'' by a ''tokenizer'' that expects to only see pretokens output by its preprocessor. Each token it produces is a string of one or more characters belonging to a finite set of strings called the ''vocabulary'' <math>V</math>. Third, because the vocabulary is finite and known beforehand, each token can be assigned an integer identifier, and this mapping is applied to the sequence of tokens to represent any input text as a numerical sequence. Since this mapping is bijective, the output side can produce a sequence of integer identifiers which can then be turned back into tokens. After undoing some of the preprocessing, the result is again legible text.

Training a tokenizer (sometimes referred to as ''vocabularization'') means finding a suitable vocabulary <math>V</math>, but also learning how to use it, since any given string <math>s</math> of length <math>|s|</math> has <math>2^{|s|-1}</math> hypothetical segmentations, some of which containing segments that are not in the vocabulary. The most important hyperparameter during vocabularization is the ''vocabulary size'' <math>|V|</math>: when it is small, the learned vocabulary generally consists of characters and smaller strings, and words will be segmented into many tokens. At larger sizes, it becomes affordable to dedicate tokens to full words, although depending on the preprocessor and tokenizer, it is not necessarily the case that large vocabularies will always use the largest token(s) available to segment a word.

Because tokens are not always full words, they may also be referred to as ''subwords'' and tokenization algorithms may be referred to as ''subword tokenizers''. This is also to differentiate these systems from traditional terminology used in older information retrieval and natural language processing systems, where "tokenization" was used to denote what is today called "pretokenization" (very crudely: splitting into words). In tokenizers that produce tokens that are ''not'' part of the vocabulary, a special token that does belong to the vocabulary is used as a generic stand-in, written as "[UNK]" for "unknown". In principle, any string could be hidden by such an [UNK]. Indeed, in information retrieval, pretokenizers were themselves used as tokenizers (and also called "tokenizers") with a word-level vocabulary that contained an [UNK].

Commonly used subword tokenization algorithms are byte pair encoding (BPE) and the unigram language model (ULM), which each include a vocabularization algorithm and a dedicated segmentation algorithm. There also exist several segmentation algorithms that require no learning and can be applied given a vocabulary (produced by BPE or ULM, for example), like greedily recognising tokens in a pretoken by moving through it left-to-right. Well-known software implementations of subword tokenizers are Hugging Face's <code>tokenizers</code> Python package implemented in Rust, and the <code>sentencepiece</code> Python package implemented in C++. The latter package is named as such because one of its configuration options allows disabling the built-in pretokenizer, hence effectively making entire sentences a pretoken and thus having the tokenizer see entire sentences, rather than individual words.

=== Embedding === {{Further|Word embedding}} Each integer token identifier is converted into an embedding vector via a lookup table. Equivalently stated, it multiplies a one-hot representation of the token identifier by an embedding matrix <math>M</math>. For example, if the input token's identifier is <math>3</math>, then the one-hot representation is <math>[0, 0, 0, 1, 0, 0, \dots]</math>, and its embedding vector is<math display="block">\mathrm{Embed}(3) = [0, 0, 0, 1, 0, 0, \dots]M</math>The token embedding vectors are added to their respective positional encoding vectors (see below), producing the sequence of input vectors.

The dimension of an embedding vector is called ''hidden size'' or ''embedding size'' and written as <math>d_{\text{emb}}</math>.<ref name=":03"/> This size is written as <math>d_{\text{model}}</math> in the original transformer paper.<ref name="2017_Attention_Is_All_You_Need" />

=== Un-embedding === An un-embedding layer is almost the reverse of an embedding layer. Whereas an embedding layer converts a token identifier into a vector, an un-embedding layer converts a vector into a probability distribution over tokens. center|thumb|800x800px|An illustration of the top 16 token probabilities at temperature 1, for each output token in the chain-of-thought response, with colour representing how that output differs from the same prompt but at temperature 0. The un-embedding layer is a linear-softmax layer:<math display="block">\mathrm{UnEmbed}(x) = \mathrm{softmax}(xW + b)</math>The matrix has shape <math>(d_{\text{emb}}, |V|)</math>. Some architectures use the transpose of the embedding matrix <math>M</math> as the un-embedding matrix <math>W</math> in order to avoid needing double the amount of embedding-related parameters and to avoid divergence during training. This practice is called ''weight tying''.<ref>{{Citation |last1=Press |first1=Ofir |title=Using the Output Embedding to Improve Language Models |date=2017-02-21 |arxiv=1608.05859 |last2=Wolf |first2=Lior}}</ref>

=== Positional encoding === thumb|Illustration of (absolute) positional encoding with parameters <math>N=10000, d=100</math>

A positional encoding is a fixed-size vector representation of the relative positions of tokens within a sequence: it provides the transformer model with information about ''where'' the words are in the input sequence. This induces a bias towards the order of the input sequence, so that, for example, the input sequence "man bites dog" is processed differently from "dog bites man".

The positional encoding is defined as a function of type <math>f: \R \to \R^d</math>, where <math>d</math> is a positive even integer. The full positional encoding defined in the original paper<ref name="2017_Attention_Is_All_You_Need" /> is:<math display="block">(f(t)_{2k}, f(t)_{2k+1}) = (\sin(\theta), \cos(\theta)) \quad \forall k \in \{0, 1, \ldots, d/2 - 1\}</math>where <math>\theta = \frac{t}{r^k}, r = N^{2/d}</math>.

Here, <math>N</math> is a free parameter that should be significantly larger than the biggest <math>k</math> that would be input into the positional encoding function. The original paper uses <math>N=10000</math>.

The function is in a simpler form when written as a complex function of type <math>f: \R \to \mathbb C^{d/2}</math><math display="block">f(t) = \left(e^{it/r^k}\right)_{k=0, 1, \ldots, \frac d 2 - 1}</math>where <math>r = N^{2/d}</math>.

The main reason for using this positional encoding function is that using it, shifts are linear transformations:<math display="block">f(t + \Delta t) = \mathrm{diag}(f(\Delta t)) f(t)</math>where <math>\Delta t \in \R</math> is the distance one wishes to shift. This allows the transformer to take any encoded position, and find the encoding of the position n-steps-ahead or n-steps-behind, by a matrix multiplication.

By taking a linear sum, any convolution can also be implemented as linear transformations:<math display="block">\sum_j c_j f(t + \Delta t_j) = \left(\sum_j c_j \,\mathrm{diag}(f(\Delta t_j))\right) f(t)</math>for any constants <math>c_j</math>. This allows the transformer to take any encoded position and find a linear sum of the encoded locations of its neighbors. This sum of encoded positions, when fed into the attention mechanism, would create attention weights on its neighbors, much like what happens in a convolutional neural network language model. In the author's words, "we hypothesized it would allow the model to easily learn to attend by relative position."

In typical implementations, all operations are done over the real numbers, not the complex numbers, but since complex multiplication can be implemented as real 2-by-2 matrix multiplication, this is a mere notational difference.

=== Encoder–decoder (overview) === thumb|One encoder–decoder block|220x220px thumb|A transformer is composed of stacked encoder layers and decoder layers. Like earlier seq2seq models, the original transformer model used an '''encoder–decoder''' architecture. The encoder consists of encoding layers that process all the input tokens together one layer after another, while the decoder consists of decoding layers that iteratively process the encoder's output and the decoder's output tokens so far.

The purpose of each encoder layer is to create contextualized representations of the tokens, where each representation corresponds to a token that "mixes" information from other input tokens via self-attention mechanism. Each decoder layer contains two attention sublayers: (1) cross-attention for incorporating the output of encoder (contextualized input token representations), and (2) self-attention for "mixing" information among the input tokens to the decoder (i.e. the tokens generated so far during inference time).<ref>{{cite web|url=https://indico.io/blog/sequence-modeling-neural-networks-part2-attention-models/|title=Sequence Modeling with Neural Networks (Part 2): Attention Models|date=2016-04-18|website=Indico|access-date=2019-10-15|archive-date=2020-10-21|archive-url=https://web.archive.org/web/20201021203352/https://indico.io/blog/sequence-modeling-neural-networks-part2-attention-models/|url-status=live |last1=Lintz |first1=Nathan }}</ref><ref name=":1">{{cite web |last=Alammar |first=Jay |title=The Illustrated transformer |url=http://jalammar.github.io/illustrated-transformer/ |url-status=live |archive-url=https://web.archive.org/web/20201018061610/https://jalammar.github.io/illustrated-transformer/ |archive-date=2020-10-18 |access-date=2019-10-15 |website=jalammar.github.io}}</ref>

Both the encoder and decoder layers have a feed-forward neural network for additional processing of their outputs and contain residual connections and layer normalization steps.<ref name=":1" /> These feed-forward layers contain most of the parameters in a transformer model.

=== Feedforward network === {{Anchor|FFN|Feedforward network|Feedforward module}}thumb|The feedforward network module. It is a two-layered network that maps <math>d_{\text{emb}}</math>-dimensional vectors into <math>d_{\text{emb}}</math>-dimensional vectors. The feedforward network (FFN) modules in a transformer are 2-layered multilayer perceptrons:<math display="block">\mathrm{FFN}(x) = \phi(xW^{(1)} + b^{(1)})W^{(2)} + b^{(2)}</math>where <math>W^{(1)}</math> and <math>W^{(2)}</math> are weight matrices and <math>b^{(1)}</math> and <math>b^{(2)}</math> are bias vectors, and <math>\phi</math> is its activation function. The original transformer used ReLU activation.

The number of neurons in the middle layer is called ''intermediate size'' (GPT),<ref>{{Cite web |last=Team |first=Keras |title=Keras documentation: GPT2Backbone model |url=https://keras.io/api/keras_nlp/models/gpt2/gpt2_backbone/ |access-date=2024-08-08 |website=keras.io |language=en}}</ref> ''filter size'' (BERT),<ref name=":03" /> or ''feedforward size'' (BERT).<ref name=":03" /> It is typically larger than the embedding size. For example, in both GPT-2 series and BERT series, the intermediate size of a model is 4 times its embedding size: <math>d_{\text{ffn}} = 4 d_{\text{emb}}</math>.

=== Scaled dot-product attention === {{Main|Dot-product attention}}

==== Attention head ==== thumb|Scaled dot-product attention, block diagram thumb|Exact dimension counts within an attention head module The attention mechanism used in the transformer architecture are scaled dot-product attention units. For each unit, the transformer model learns three weight matrices: the query weights <math>W^Q</math>, the key weights <math>W^K</math>, and the value weights <math>W^V</math>.

The module takes three sequences, a query sequence, a key sequence, and a value sequence. The query sequence is a sequence of length <math>\ell_{\text{seq, query}}</math>, and each entry is a vector of dimension <math>d_{\text{emb, query}}</math>. Similarly for the key and value sequences.

For each vector <math>x_{i, \text{query}}</math> in the query sequence, it is multiplied by a matrix <math>W^Q</math> to produce a query vector <math>q_i = x_{i, \text{query}} W^Q</math>. The matrix of all query vectors is the query matrix:<math display="block">Q = X_{\text{query}} W^Q</math>Similarly, we construct the key matrix <math>K = X_{\text{key}} W^K</math> and the value matrix <math>V = X_{\text{value}} W^V</math>.

It is usually the case that all <math>W^Q, W^K, W^V </math> are square matrices, meaning <math>d_{\text{emb, query}}= d_{\text{query}}</math>, etc.

Attention weights are calculated using the query and key vectors: the attention weight <math>a_{ij}</math> from token <math>i</math> to token <math>j</math> is the dot product between <math>q_i</math> and <math>k_j</math>. The attention weights are divided by the square root of the dimension of the key vectors, <math>\sqrt{d_k}</math>, which stabilizes gradients during training, and passed through a softmax which normalizes the weights. The fact that <math>W^Q</math> and <math>W^K</math> are different matrices allows attention to be non-symmetric: if token <math>i</math> attends to token <math>j</math> (i.e. <math>q_i\cdot k_j</math> is large), this does not necessarily mean that token <math>j</math> will attend to token <math>i</math> (i.e. <math>q_j\cdot k_i</math> could be small). The output of the attention unit for token <math>i</math> is the weighted sum of the value vectors of all tokens, weighted by <math>a_{ij}</math>, the attention from token <math>i</math> to each token.

The attention calculation for all tokens can be expressed as one large matrix calculation using the softmax function, which is useful for training due to computational matrix operation optimizations that quickly compute matrix operations. The matrices <math>Q</math>, <math>K</math> and <math>V</math> are defined as the matrices where the <math>i</math>th rows are vectors <math>q_i</math>, <math>k_i</math>, and <math>v_i</math> respectively. Then we can represent the attention as<math display="block">\begin{align} \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\mathrm{T}}{\sqrt{d_k}}\right)V \end{align}</math>

where the softmax is applied over each of the rows of the matrix.

The number of dimensions in a query vector is ''query size'' <math>d_{\text{query}}</math> and similarly for the ''key size'' <math>d_{\text{key}}</math> and ''value size'' <math>d_{\text{value}}</math>. The output dimension of an attention head is its ''head dimension'' <math>d_{\text{head}}</math>. The attention mechanism requires the following three equalities to hold:<math display="block">\ell_{\text {seq, key}}=\ell_{\text {seq, value}}, \;d_{\text {query}}=d_{\text {key}}, \; d_{\text {value}}=d_{\text {head}} </math>but is otherwise unconstrained.

If the attention head is used in a self-attention fashion, then <math>X_{\text{query}} = X_{\text{key}} = X_{\text{value}} </math>. If the attention head is used in a cross-attention fashion, then usually <math>X_{\text{query}} \neq X_{\text{key}} = X_{\text{value}} </math>. It is theoretically possible for all three to be different, but that is rarely the case in practice.

==== Multihead attention ==== thumb|Multihead attention, block diagram thumb|Exact dimension counts within a multihead attention module One set of <math>\left( W^Q, W^K, W^V \right)</math> matrices is called an ''attention head'', and each layer in a transformer model has multiple attention heads. While each attention head attends to the tokens that are relevant to each token, multiple attention heads allow the model to do this for different definitions of "relevance". Specifically, the query and key projection matrices, <math>W^Q</math> and <math>W^K</math> , which are involved in the attention score computation, defines the "relevance". Meanwhile, the value projection matrix <math>W^V</math>, in combination with the part of the output projection matrix <math>W^O</math>, determines how the attended tokens influence what information is passed to subsequent layers and ultimately the output logits. In addition, the scope of attention, or the range of token relationships captured by each attention head, can expand as tokens pass through successive layers. This allows the model to capture more complex and long-range dependencies in deeper layers. Many transformer attention heads encode relevance relations that are meaningful to humans. For example, some attention heads can attend mostly to the next word, while others mainly attend from verbs to their direct objects.<ref>{{cite journal|last1=Clark|first1=Kevin |last2=Khandelwal|first2=Urvashi|last3=Levy|first3=Omer|last4=Manning|first4=Christopher D.|date=August 2019|title=What Does BERT Look at? An Analysis of BERT's Attention|url=https://www.aclweb.org/anthology/W19-4828|journal=Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP|location=Florence, Italy|publisher=Association for Computational Linguistics|pages=276–286|doi=10.18653/v1/W19-4828|doi-access=free|access-date=2020-05-20|archive-date=2020-10-21|archive-url=https://web.archive.org/web/20201021211357/https://www.aclweb.org/anthology/W19-4828/|url-status=live|arxiv=1906.04341}}</ref> The computations for each attention head can be performed in parallel, which allows for fast processing. The outputs for the attention layer are concatenated to pass into the feedforward neural network layers.

Concretely, let the multiple attention heads be indexed by <math>i</math>, then we have<math display="block">\text{MultiheadAttention}(Q, K, V) = \text{Concat}_{i \in [n_{\text{heads}}]}(\text{Attention}(XW^Q_i, XW^K_i, XW^V_i)) W^O</math> where the matrix <math>X</math> is the concatenation of word embeddings, and the matrices <math>W^Q_i, W^K_i, W^V_i</math> are "projection matrices" owned by individual attention head <math>i</math>, and <math>W^O</math> is a final projection matrix owned by the whole multihead attention head.

It is theoretically possible for each attention head to have a different head dimension <math>d_{\text{head}}</math>, but that is rarely the case in practice.

As an example, in the smallest GPT-2 model, there are only self-attention mechanisms. It has the following dimensions:<math display="block">d_{\text{emb}} = 768, n_{\text{head}} = 12, d_{\text{head}} = 64</math>Since <math>12 \times 64 = 768</math>, its output projection matrix <math>W^O \in \R^{(12 \times 64) \times 768}</math> is a square matrix.

==== Masked attention ==== The transformer architecture is constructed to calculate output tokens iteratively. Assuming <math>t = 0</math> refers to the calculation of the first output token <math>i = 0</math>, for step <math>t > 0</math>, the output token <math>i = 0</math> shall remain constant. This ensures properties of the model similar to autoregressive models.<ref name="2017_Attention_Is_All_You_Need" /> Therefore, at every time step <math>t</math>, the calculation for all outputs <math>i</math> should not have access to tokens at position <math>j</math> for <math>j >= i</math> (as it naturally is the case for time step <math>t=i</math>, when tokens <math>j>t</math> are not yet calculated). This behavior may be accomplished before the softmax stage by adding a mask matrix <math>M</math> that is <math>-\infty</math> at entries where the attention link must be cut, and <math>0</math> at other places:<math display="block">\begin{align} \text{MaskedAttention}(Q, K, V) = \text{softmax}\left(M + \frac{QK^\mathrm{T}}{\sqrt{d_k}}\right)V \end{align}</math> The following matrix is commonly used in decoder self-attention modules, called "causal masking":<math display="block">M_{\text{causal}} = \begin{bmatrix} 0 & -\infty & -\infty & \dots & -\infty \\ 0 & 0 & -\infty & \dots & -\infty \\ 0 & 0 & 0 & \dots & -\infty \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \dots & 0 \end{bmatrix} </math>

In words, it means that each token can pay attention to itself, and every token before it, but not any after it. A non-masked attention module can be thought of as a masked attention module where the mask has all entries zero. As an example of an uncommon use of mask matrix, the XLNet considers all masks of the form <math>P M_{\text{causal}} P^{-1} </math>, where <math>P </math> is a random permutation matrix.<ref>{{Cite journal |last1=Yang |first1=Zhilin |last2=Dai |first2=Zihang |last3=Yang |first3=Yiming |last4=Carbonell |first4=Jaime |last5=Salakhutdinov |first5=Russ R |last6=Le |first6=Quoc V |date=2019 |title=XLNet: Generalized Autoregressive Pretraining for Language Understanding |url=https://proceedings.neurips.cc/paper/2019/hash/dc6a7e655d7e5840e66733e9ee67cc69-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=32|arxiv=1906.08237 }}</ref>

=== Encoder === thumb|One encoder layer An encoder consists of an embedding layer, followed by multiple encoder layers.

Each encoder layer consists of two major components: a self-attention mechanism and a feed-forward layer. It takes an input as a sequence of input vectors, applies the self-attention mechanism, to produce an intermediate sequence of vectors, then applies the feed-forward layer for each vector individually. Schematically, we have:<math>\begin{aligned} \text{given input vectors } & h_0, h_1, \dots\\ \text{combine them into a matrix } H &= \begin{bmatrix} h_0 \\ h_1 \\ \vdots \end{bmatrix} \\ \text{EncoderLayer}(H) &= \begin{bmatrix} \text{FFN}(\text{MultiheadAttention}(H, H, H)_0) \\ \text{FFN}(\text{MultiheadAttention}(H, H, H)_1) \\ \vdots \end{bmatrix} \\

\end{aligned}</math>

where <math>\text{FFN}</math> stands for "feed-forward network". We can more succinctly write it as<math display="block">\text{EncoderLayer}(H) = \text{FFN}(\text{MultiheadAttention}(H, H, H)) </math>with the implicit convention that the <math>\text{FFN}</math> is applied to each row of the matrix individually.

The encoder layers are stacked. The first encoder layer takes the sequence of input vectors from the embedding layer, producing a sequence of vectors. This sequence of vectors is processed by the second encoder, and so on. The output from the final encoder layer is then used by the decoder.

As the encoder processes the entire input all at once, every token can attend to every other token (all-to-all attention), so there is no need for causal masking.

=== Decoder === thumb|One decoder layer A decoder consists of an embedding layer, followed by multiple decoder layers, followed by an un-embedding layer.

Each decoder consists of three major components: a causally masked self-attention mechanism, a cross-attention mechanism, and a feed-forward neural network. The decoder functions in a similar fashion to the encoder, but an additional attention mechanism is inserted which instead draws relevant information from the encodings generated by the encoders. This mechanism can also be called the ''encoder–decoder attention''.<ref name="2017_Attention_Is_All_You_Need" /><ref name=":1" />

Like the first encoder, the first decoder takes positional information and embeddings of the output sequence as its input, rather than encodings. The transformer must not use the current or future output to predict an output, so the output sequence must be partially masked to prevent this reverse information flow.<ref name="2017_Attention_Is_All_You_Need" /> This allows for autoregressive text generation. For decoding, all-to-all attention is inappropriate, because a token cannot attend to tokens not yet generated. Thus, the self-attention module in the decoder is causally masked.

In contrast, the cross-attention mechanism attends to the output vectors of the encoder, which is computed before the decoder starts decoding. Consequently, there is no need for masking in the cross-attention mechanism.

Schematically, we have:<math display="block">\begin{aligned} H' &= \text{MaskedMultiheadAttention}(H, H, H) \\ \text{DecoderLayer}(H) &=\text{FFN}(\text{MultiheadAttention}(H', H^E, H^E)) \end{aligned} </math>where <math>H^E </math> is the matrix with rows being the output vectors from the encoder.

The last decoder is followed by a final un-embedding layer to produce the output probabilities over the vocabulary. Then, one of the tokens is sampled according to the probability, and the decoder can be run again to produce the next token, etc., autoregressively generating output text.

== Full transformer architecture ==

=== Sublayers === thumb|(a) One encoder layer and one decoder layer. (b) Two encoder layers and two decoder layers. The sublayers are labelled as well. Each encoder layer contains 2 sublayers: the self-attention and the feedforward network. Each decoder layer contains 3 sublayers: the causally masked self-attention, the cross-attention, and the feedforward network. thumb|Transformer encoder with norm-first and norm-last thumb|Transformer decoder with norm-first and norm-last thumb|Block diagram for the full transformer architecture[[File:Transformer,_schematic_object_hierarchy,_for_implementation_in_object-oriented_programming.png|thumb|Schematic object hierarchy for the full transformer architecture, in object-oriented programming style]]The final points of detail are the residual connections and layer normalization, (denoted as "LayerNorm", or "LN" in the following), which while conceptually unnecessary, are necessary for numerical stability and convergence.

The residual connection, which is introduced to avoid vanishing gradient issues and stabilize the training process, can be expressed as follows: y = F(x) + x. The expression indicates that an output y is the sum of the transformation of input x (F(x)) and the input itself (x). Adding the input x can preserve the input information and avoid issues when the gradient of F(x) is close to zero.

Similarly to how the feedforward network modules are applied individually to each vector, the LayerNorm is also applied individually to each vector.

{{Anchor|pre-LN}}There are two common conventions in use: the ''post-LN'' and the ''pre-LN'' convention. In the post-LN convention, the output of each sublayer is <math display="block">\mathrm{LayerNorm}(x + \mathrm{Sublayer}(x))</math>where <math>\mathrm{Sublayer}(x)</math> is the function implemented by the sublayer itself.

In the pre-LN convention, the output of each sublayer is<math display="block">x + \mathrm{Sublayer}(\mathrm{LayerNorm}(x))</math>The original 2017 transformer used the post-LN convention. It was difficult to train and required careful hyperparameter tuning and a "warm-up" in learning rate, where it starts small and gradually increases. The pre-LN convention, proposed several times in 2018,<ref>{{Citation |last1=Wang |first1=Qiang |title=Learning Deep Transformer Models for Machine Translation |date=2019-06-04 |arxiv=1906.01787 |last2=Li |first2=Bei |last3=Xiao |first3=Tong |last4=Zhu |first4=Jingbo |last5=Li |first5=Changliang |last6=Wong |first6=Derek F. |last7=Chao |first7=Lidia S.}}</ref> was found to be easier to train, requiring no warm-up, leading to faster convergence.<ref name="auto1" />

=== Pseudocode === The following is the pseudocode for a standard pre-LN encoder–decoder transformer, adapted from ''Formal Algorithms for Transformers''<ref>{{Citation |last1=Phuong |first1=Mary |title=Formal Algorithms for Transformers |date=2022-07-19 |arxiv=2207.09238 |last2=Hutter |first2=Marcus}}</ref> '''input:''' Encoder input t_e Decoder input t_d '''output:''' Array of probability distributions, with shape (decoder vocabulary size x length(decoder output sequence)) /* encoder */ z_e ← encoder.tokenizer(t_e) '''for''' '''each''' t '''in''' 1:length(z_e) '''do''' z_e[t] ← encoder.embedding(z_e[t]) + encoder.positional_embedding(t) '''for''' '''each''' l '''in''' 1:length(encoder.layers) '''do''' layer ← encoder.layers[l] /* first sublayer */ z_e_copy ← copy(z_e) '''for each''' t '''in''' 1:length(z_e) '''do''' z_e[t] ← layer.layer_norm(z_e[t]) z_e ← layer.multihead_attention(z_e, z_e, z_e) '''for each''' t '''in''' 1:length(z_e) '''do''' z_e[t] ← z_e[t] + z_e_copy[t] /* second sublayer */ z_e_copy ← copy(z_e) '''for each''' t '''in''' 1:length(z_e) '''do''' z_e[t] ← layer.layer_norm(z_e[t]) z_e ← layer.feedforward(z_e) '''for each''' t '''in''' 1:length(z_e) '''do''' z_e[t] ← z_e[t] + z_e_copy[t] '''for each''' t '''in''' 1:length(z_e) '''do''' z_e[t] ← encoder.final_layer_norm(z_e[t]) /* decoder */ z_d ← decoder.tokenizer(t_d) '''for''' '''each''' t '''in''' 1:length(z_d) '''do''' z_d[t] ← decoder.embedding(z_d[t]) + decoder.positional_embedding(t) '''for''' '''each''' l '''in''' 1:length(decoder.layers) '''do''' layer ← decoder.layers[l] /* first sublayer */ z_d_copy ← copy(z_d) '''for each''' t '''in''' 1:length(z_d) '''do''' z_d[t] ← layer.layer_norm(z_d[t]) z_d ← layer.masked_multihead_attention(z_d, z_d, z_d) '''for each''' t '''in''' 1:length(z_d) '''do''' z_d[t] ← z_d[t] + z_d_copy[t] /* second sublayer */ z_d_copy ← copy(z_d) '''for each''' t '''in''' 1:length(z_d) '''do''' z_d[t] ← layer.layer_norm(z_d[t]) z_d ← layer.multihead_attention(z_d, z_e, z_e) '''for each''' t '''in''' 1:length(z_d) '''do''' z_d[t] ← z_d[t] + z_d_copy[t] /* third sublayer */ z_d_copy ← copy(z_d) '''for each''' t '''in''' 1:length(z_d) '''do''' z_d[t] ← layer.layer_norm(z_d[t]) z_d ← layer.feedforward(z_d) '''for each''' t '''in''' 1:length(z_d) '''do''' z_d[t] ← z_d[t] + z_d_copy[t] z_d ← decoder.final_layer_norm(z_d) output_distributions ← [] '''for each''' t '''in''' 1:length(z_d) '''do''' output_distributions.append(decoder.unembed(z_d[t])) '''return''' output_distributions

=== Terminology === The transformer architecture, being modular, allows variations. Several common variations are described here.<ref name="Raffel Shazeer 2020 Exploring the Limits"/>

{{Anchor|encoder-only}}An "encoder-only" transformer applies the encoder to map an input text into a sequence of vectors that represent the input text. This is usually used for text embedding and representation learning for downstream applications. BERT is encoder-only. They are less often used currently, as they were found to be not significantly better than training an encoder–decoder transformer, then taking just the encoder.<ref name=":4" /> They are also referred to as "all-to-all" or "BERT-like".

{{Anchor|decoder-only}}A "decoder-only" transformer is not literally decoder-only, since without an encoder, the cross-attention mechanism has nothing to attend to. Thus, the decoder layers in a decoder-only transformer is composed of just two sublayers: the causally masked self-attention, and the feedforward network. This is usually used for text generation and instruction following. The models in the GPT series and Chinchilla series are decoder-only. They are also referred to as "autoregressive" or "causal".

{{Anchor|encoder-decoder}}An "encoder–decoder" transformer is generally the same as the original transformer, with 2 sublayers per encoder layer and 3 sublayers per decoder layer, etc. They might have minor architectural improvements, such as alternative activation functions, changing the location of normalization, etc. This is also usually used for text generation and instruction following. The models in the T5 series are encoder–decoder.<ref name="Raffel Shazeer 2020 Exploring the Limits"/>

{{Anchor|prefixLM}}A "prefixLM" (prefix language model) is a decoder-only architecture, but with prefix masking, which is different from causal masking. Specifically, it has mask of the form<ref name="Raffel Shazeer 2020 Exploring the Limits"/>{{Pg|location=Figure 3}}<math display="block">M_{\text{prefixLM}} = \begin{bmatrix} \mathbf{0} & -\infty \\ \mathbf{0} & M_{\text{causal}} \end{bmatrix} </math>where the first columns correspond to the "prefix", and the subsequent columns correspond to the autoregressively generated text based on the prefix. They resemble encoder–decoder models, but has less "sparsity". Such models are rarely used, though they are cited as theoretical possibilities and benchmarked comparisons.<ref name=":4">{{Citation |last1=Tay |first1=Yi |title=UL2: Unifying Language Learning Paradigms |date=2023-02-28 |arxiv=2205.05131 |last2=Dehghani |first2=Mostafa |last3=Tran |first3=Vinh Q. |last4=Garcia |first4=Xavier |last5=Wei |first5=Jason |last6=Wang |first6=Xuezhi |last7=Chung |first7=Hyung Won |last8=Shakeri |first8=Siamak |last9=Bahri |first9=Dara}}</ref>

There are also mixed seq2seq models. For example, in 2020, Google Translate replaced the previous RNN-encoder–RNN-decoder model with a transformer-encoder–RNN-decoder model, as transformer-based decoders did not appear to significantly increase quality unlike the encoder, while the RNN decoder was much faster.<ref name="gtrans" />

== Subsequent work == === Alternative activation functions === The original transformer uses ReLU activation function. Other activation functions were developed. The Llama series and PaLM used SwiGLU;<ref name=":14">{{Cite arXiv |eprint=2002.05202 |class=cs.LG |first=Noam |last=Shazeer |title=GLU Variants Improve Transformer |date=2020-02-01}}</ref> both GPT-1 and BERT<ref name=":03" /> used GELU.<ref>{{Cite arXiv |last1=Hendrycks |first1=Dan |last2=Gimpel |first2=Kevin |date=2016-06-27 |title=Gaussian Error Linear Units (GELUs) |class=cs.LG |eprint=1606.08415v5 |language=en}}</ref>

Alternative activation functions are often used in combination with Gated Linear Units in the feedforward module.<ref name=":14" />

=== Alternative normalizations === The normalization used in the transformer can be different from LayerNorm. One example is RMSNorm<ref>{{Cite journal |last1=Zhang |first1=Biao |last2=Sennrich |first2=Rico |date=2019 |title=Root Mean Square Layer Normalization |url=https://proceedings.neurips.cc/paper/2019/hash/1e8a19426224ca89e83cef47f1e7f53b-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=32|arxiv=1910.07467 }}</ref> which is used in the Llama series. Other examples include ScaleNorm<ref name=":9">{{Cite journal |last1=Nguyen |first1=Toan Q. |last2=Salazar |first2=Julian |date=2019-11-02 |editor-last=Niehues |editor-first=Jan |editor2-last=Cattoni |editor2-first=Rolando |editor3-last=Stüker |editor3-first=Sebastian |editor4-last=Negri |editor4-first=Matteo |editor5-last=Turchi |editor5-first=Marco |editor6-last=Ha |editor6-first=Thanh-Le |editor7-last=Salesky |editor7-first=Elizabeth |editor8-last=Sanabria |editor8-first=Ramon |editor9-last=Barrault |editor9-first=Loic |title=Transformers without Tears: Improving the Normalization of Self-Attention |url=https://aclanthology.org/2019.iwslt-1.17 |journal=Proceedings of the 16th International Conference on Spoken Language Translation |location=Hong Kong |publisher=Association for Computational Linguistics|doi=10.5281/zenodo.3525484 |arxiv=1910.05895 }}</ref> and FixNorm.<ref name=":9" />

=== Alternative positional encodings === Transformers may use other positional encoding methods than sinusoidal.<ref>{{cite journal |last1=Dufter |first1=Philipp |last2=Schmitt |first2=Martin |last3=Schütze |first3=Hinrich |title=Position Information in Transformers: An Overview |journal=Computational Linguistics |date=September 2022 |volume=48 |issue=3 |pages=733–763 |doi=10.1162/coli_a_00445 |doi-access=free |arxiv=2102.11090 }}</ref>

The original transformer paper reported using a learned positional encoding,<ref>{{Cite journal |last1=Gehring |first1=Jonas |last2=Auli |first2=Michael |last3=Grangier |first3=David |last4=Yarats |first4=Denis |last5=Dauphin |first5=Yann N. |date=2017-07-17 |title=Convolutional Sequence to Sequence Learning |url=https://proceedings.mlr.press/v70/gehring17a.html |journal=Proceedings of the 34th International Conference on Machine Learning |language=en |publisher=PMLR |pages=1243–1252}}</ref> but finding it not superior to the sinusoidal one.<ref name="2017_Attention_Is_All_You_Need" /> Later,<ref>{{Citation |last1=Haviv |first1=Adi |title=Transformer Language Models without Positional Encodings Still Learn Positional Information |date=2022-12-05 |arxiv=2203.16634 |last2=Ram |first2=Ori |last3=Press |first3=Ofir |last4=Izsak |first4=Peter |last5=Levy |first5=Omer}}</ref> found that causal masking itself provides enough signal to a transformer decoder that it can learn to implicitly perform absolute positional encoding without the positional encoding module.

==== RoPE ==== {{Anchor|Rotary positional embedding}}RoPE (rotary positional embedding),<ref>{{Cite arXiv|last1=Su |first1=Jianlin |last2=Lu |first2=Yu |last3=Pan |first3=Shengfeng |last4=Murtadha |first4=Ahmed |last5=Wen |first5=Bo |last6=Liu |first6=Yunfeng |date=2021-04-01 |title=RoFormer: Enhanced Transformer with Rotary Position Embedding |class=cs.CL |eprint=2104.09864 }}</ref> is best explained by considering a list of 2-dimensional vectors <math>[(x^{(1)}_1, x^{(2)}_1), (x^{(1)}_2, x^{(2)}_2), (x^{(1)}_3, x^{(2)}_3), ...]</math>. Now pick some angle <math>\theta</math>. Then RoPE encoding is<math display="block">\text{RoPE}\big(x^{(1)}_m, x^{(2)}_m, m\big) = \begin{pmatrix} \cos m \theta & - \sin m \theta \\ \sin m \theta & \cos m \theta \end{pmatrix} \begin{pmatrix} x^{(1)}_m \\ x^{(2)}_m \\ \end{pmatrix} = \begin{pmatrix} x^{(1)}_m \cos m\theta - x^{(2)}_m \sin m \theta \\ x^{(2)}_m \cos m\theta + x^{(1)}_m \sin m \theta \\ \end{pmatrix} </math>Equivalently, if we write the 2-dimensional vectors as complex numbers <math>z_m := x^{(1)}_m + i x^{(2)}_m</math>, then RoPE encoding is just multiplication by an angle:<math display="block">\text{RoPE}\big(z_m, m\big) = e^{i m\theta} z_m </math>For a list of <math>2n</math>-dimensional vectors, a RoPE encoder is defined by a sequence of angles <math>\theta^{(1)}, ..., \theta^{(n)}</math>. Then the RoPE encoding is applied to each pair of coordinates.

The benefit of RoPE is that the dot-product between two vectors depends on their relative location only:<math display="block"> \text{RoPE}\big(x, m\big)^T\text{RoPE}\big(y, n\big) = \text{RoPE}\big(x, m+k\big)^T\text{RoPE}\big(y, n+k\big) </math> for any integer <math>k</math>.

==== ALiBi ==== ALiBi (Attention with Linear Biases)<ref>{{Cite arXiv|last1=Press |first1=Ofir |last2=Smith |first2=Noah A. |last3=Lewis |first3=Mike |date=2021-08-01 |title=Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation |class=cs.CL |eprint=2108.12409 }}</ref> is not a ''replacement'' for the positional encoder on the original transformer. Instead, it is an ''additional'' positional encoder that is directly plugged into the attention mechanism. Specifically, the ALiBi attention mechanism is<math display="block">\begin{align} \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\mathrm{T}}{\sqrt{d_k}} + s B\right)V \end{align}</math>Here, <math>s</math> is a real number ("scalar"), and <math>B</math> is the ''linear bias'' matrix defined by<math display="block">B = \begin{pmatrix} 0 & 1 & 2 & 3 & \cdots \\ -1 & 0 & 1 & 2 & \cdots \\ -2 & -1 & 0 & 1 & \cdots \\ -3 & -2 & -1 & 0 & \cdots \\ \vdots & \vdots & \vdots & \vdots & \ddots \\ \end{pmatrix} </math>in other words, <math>B_{i, j} = j - i</math>. The idea being that the linear bias matrix is a softened mask. Just as <math>0</math> represent full attention paid, and <math>-\infty</math> represents no attention paid, the linear bias matrix increases attention paid in one direction and decreases attention paid in the other direction.

ALiBi allows pretraining on short context windows, then fine-tuning on longer context windows. Since it is directly plugged into the attention mechanism, it can be combined with any positional encoder that is plugged into the "bottom" of the entire network (which is where the sinusoidal encoder on the original transformer, as well as RoPE and many others, are located).

==== Relative Position Encodings==== Relative Position Encodings<ref>{{Cite arXiv |last1=Shaw |first1=Peter |last2=Uszkoreit |first2=Jakob |last3=Vaswani |first3=Ashish |date=2018 |title=Self-Attention with Relative Position Representations |class=cs.CL |eprint=1803.02155}}</ref> is similar to ALiBi, but more generic:<math display="block">\begin{align} \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\mathrm{T}}{\sqrt{d_k}} + B\right)V \end{align}</math>where <math>B</math> is a Toeplitz matrix, that is, <math>B_{i, j} = B_{i', j'}</math> whenever <math>i-j = i'-j'</math>. This is contrasted with the original sinusoidal positional encoding, which is an "absolute positional encoding".<ref>{{Citation |last1=Ke |first1=Guolin |title=Rethinking Positional Encoding in Language Pre-training |date=2021-03-15 |arxiv=2006.15595 |last2=He |first2=Di |last3=Liu |first3=Tie-Yan}}</ref>

=== Efficient implementation === The transformer model has been implemented in standard deep learning frameworks such as TensorFlow and PyTorch. ''Transformers'' is a library produced by Hugging Face that supplies transformer-based architectures and pretrained models.<ref name="wolf2020" />

==== KV caching ==== When an autoregressive transformer is used for inference, such as generating text, the query vector is different at each step, but the already-computed key and value vectors are always the same. The '''KV caching''' method saves the computed key and value vectors at each attention block, so that they are not recomputed at each new token. PagedAttention applies memory paging to KV caching.<ref>{{Cite book |last1=Kwon |first1=Woosuk |title=Proceedings of the 29th Symposium on Operating Systems Principles |last2=Li |first2=Zhuohan |last3=Zhuang |first3=Siyuan |last4=Sheng |first4=Ying |last5=Zheng |first5=Lianmin |last6=Yu |first6=Cody Hao |last7=Gonzalez |first7=Joseph |last8=Zhang |first8=Hao |last9=Stoica |first9=Ion |date=2023-10-23 |publisher=Association for Computing Machinery |isbn=979-8-4007-0229-7 |series=SOSP '23 |location=New York, NY, USA |pages=611–626 |chapter=Efficient Memory Management for Large Language Model Serving with PagedAttention |doi=10.1145/3600006.3613165 |chapter-url=https://dl.acm.org/doi/10.1145/3600006.3613165 |arxiv=2309.06180}}</ref><ref>{{Citation |title=vllm-project/vllm |date=2024-06-20 |url=https://github.com/vllm-project/vllm |access-date=2024-06-20 |publisher=vLLM}}</ref><ref>{{Cite web |first1=Woosuk Kwon |last1=Zhuohan Li |first2=Siyuan |last2=Zhuang |first3=Ying |last3=Sheng |first4=Lianmin |last4=Zheng |first5=Cody |last5=Yu |first6=Joey |last6=Gonzalez |first7=Hao |last7=Zhang |first8=Ion |last8=Stoica |date=2023-06-20 |title=vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention |url=https://blog.vllm.ai/2023/06/20/vllm.html |access-date=2024-06-20 |website=vLLM Blog |language=en}}</ref>

If a transformer is used with a baked-in prompt, such as ["You are a customer support agent..."], then the key and value vectors can be computed for the prompt, and saved on disk. The saving in compute is significant when the model is used for many short real-time interactions, such as in online chatbots.

In general, when a user uses an autoregressive transformer to generate a continuation to a sequence of tokens, the model would first perform a forward-pass on this sequence, whereby the KV caches over this sequence are computed. This is called '''prefilling'''. Hyperscalers serving large Transformer models may use '''disaggregated inference''', wherein prefilling and decoding are performed on separately specialized hardware.<ref>{{Citation |last1=Hu |first1=Cunchen |title=Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads |date=2024-01-20 |arxiv=2401.11181 |last2=Huang |first2=Heyang |last3=Xu |first3=Liangliang |last4=Chen |first4=Xusheng |last5=Xu |first5=Jiang |last6=Chen |first6=Shuang |last7=Feng |first7=Hao |last8=Wang |first8=Chenxi |last9=Wang |first9=Sa}}</ref>

==== FlashAttention ==== {{Anchor|FlashAttention|MQA|GQA|Multihead attention|Multi-query attention|Grouped-query attention}}FlashAttention<ref>{{cite book |last1=Dao |first1=Tri |last2=Ermon |first2=Stefano |last3=Fu |first3=Dan |last4=Ré |first4=Christopher |last5=Rudra |first5=Atri |title=Advances in Neural Information Processing Systems 35 |chapter=FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness |date=2022 |pages=16344–16359 |doi=10.52202/068431-1189 |isbn=978-1-7138-7108-8 }}</ref> is an algorithm that implements the transformer attention mechanism efficiently on a GPU. It is a communication-avoiding algorithm that performs matrix multiplications in blocks, such that each block fits within the cache of a GPU, and by careful management of the blocks it minimizes data copying between GPU caches (as data movement is slow). See the page on softmax for details.

An improved version, FlashAttention-2,<ref>{{cite web |title=Stanford CRFM |url=https://crfm.stanford.edu/2023/07/17/flash2.html |access-date=2023-07-18 |website=crfm.stanford.edu}}</ref><ref>{{cite web |date=2023-06-17 |title=FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning |url=https://princeton-nlp.github.io/flash-atttention-2/ |access-date=2023-07-18 |website=Princeton NLP }}</ref><ref>{{cite web |title=Introducing Together AI Chief Scientist Tri Dao, as he releases FlashAttention-2 to speed up model training and inference |url=https://together.ai/blog/tri-dao-flash-attention |access-date=2023-07-18 |website=TOGETHER }}</ref> was developed to cater to the rising demand for language models capable of handling longer context lengths. It offers enhancements in work partitioning and parallelism, enabling it to achieve up to 230 TFLOPs/s on A100 GPUs (FP16/BF16), a 2x speed increase over the original FlashAttention.

Key advancements in FlashAttention-2 include the reduction of non-matmul FLOPs, improved parallelism over the sequence length dimension, better work partitioning between GPU warps, and added support for head dimensions up to 256 and multi-query attention (MQA) and grouped-query attention (GQA).<ref>{{cite arXiv |last1=Ainslie |first1=Joshua |title=GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints |date=2023-12-23 |eprint=2305.13245 |last2=Lee-Thorp |first2=James |last3=de Jong |first3=Michiel |last4=Zemlyanskiy |first4=Yury |last5=Lebrón |first5=Federico |last6=Sanghai |first6=Sumit|class=cs.CL }}</ref>

Benchmarks revealed FlashAttention-2 to be up to 2x faster than FlashAttention and up to 9x faster than a standard attention implementation in PyTorch. Future developments include optimization for new hardware like H100 GPUs and new data types like FP8.

FlashAttention-4 focuses on pipelining to increase instruction throughput, and was developed to perform particularly well on Blackwell GPUs.<ref>{{Cite web |title=We reverse-engineered Flash Attention 4 |url=https://modal.com/blog/reverse-engineer-flash-attention-4 |access-date=2025-09-26 |website=Modal |language=en}}</ref>

==== Multi-Query Attention ==== {{Anchor|MHA|MQA|GQA|Multihead attention|Multi-query attention|Grouped-query attention}} thumb|Comparison between several different forms of attention mechanism and the amount of KV caching necessary for each Multi-Query Attention changes the Multihead Attention mechanism.<ref>{{Cite arXiv|last1=Chowdhery |first1=Aakanksha |last2=Narang |first2=Sharan |last3=Devlin |first3=Jacob |last4=Bosma |first4=Maarten |last5=Mishra |first5=Gaurav |last6=Roberts |first6=Adam |last7=Barham |first7=Paul |last8=Chung |first8=Hyung Won |last9=Sutton |first9=Charles |last10=Gehrmann |first10=Sebastian |last11=Schuh |first11=Parker |last12=Shi |first12=Kensen |last13=Tsvyashchenko |first13=Sasha |last14=Maynez |first14=Joshua |last15=Rao |first15=Abhishek |date=2022-04-01 |title=PaLM: Scaling Language Modeling with Pathways |class=cs.CL |eprint=2204.02311 }}</ref> Whereas normally,

<math display="block">\text{MultiheadAttention}(Q, K, V) = \text{Concat}_{i \in [n_{\text{heads}}]}\left(\text{Attention}(XW^Q_i, XW^K_i, XW^V_i)\right) W^O</math>with Multi-Query Attention, there is just one <math>W^K, W^V</math>, thus:

<math display="block">\text{MultiQueryAttention}(Q, K, V) = \text{Concat}_{i \in [n_{\text{heads}}]}\left(\text{Attention}(XW^Q_i, XW^K, XW^V)\right) W^O</math>

This has a neutral effect on model quality and training speed, but increases inference speed.

More generally, grouped-query attention (GQA) partitions attention heads into groups, each of which shares the key-value pair. MQA is GQA with one group, while standard Multihead Attention is GQA with the maximal number of groups.<ref>{{Citation |last1=Ainslie |first1=Joshua |title=GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints |date=2023-12-23 |arxiv=2305.13245 |last2=Lee-Thorp |first2=James |last3=de Jong |first3=Michiel |last4=Zemlyanskiy |first4=Yury |last5=Lebrón |first5=Federico |last6=Sanghai |first6=Sumit}}</ref> [[File:DeepSeek_MoE_and_MLA_(DeepSeek-V2).svg|thumb|The architecture of V2, showing both MLA and a variant of mixture of experts<ref name=":73">{{Citation |author1=DeepSeek-AI |title=DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model |date=19 June 2024 |arxiv=2405.04434 |last2=Liu |first2=Aixin |last3=Feng |first3=Bei |last4=Wang |first4=Bin |last5=Wang |first5=Bingxuan |last6=Liu |first6=Bo |last7=Zhao |first7=Chenggang |last8=Dengr |first8=Chengqi |last9=Ruan |first9=Chong}}.</ref>{{Pg|location=Figure 2}}]] {{Anchor|MLA|Multihead Latent Attention}} Multihead Latent Attention (MLA) is a low-rank approximation to standard MHA. Specifically, each hidden vector, before entering the attention mechanism, is first projected to two low-dimensional spaces ("latent space"), one for query and one for key-value (KV vector). This design minimizes the KV cache, as only the low-dimensional KV vector needs to be cached.<ref name=":73" />

==== Speculative decoding ==== {{Main|Speculative decoding}} Speculative decoding<ref name=":2">{{Citation |last1=Leviathan |first1=Yaniv |title=Fast Inference from Transformers via Speculative Decoding |date=2023-05-18 |arxiv=2211.17192 |last2=Kalman |first2=Matan |last3=Matias |first3=Yossi}}</ref><ref>{{cite web|url=https://yaofu.notion.site/Towards-100x-Speedup-Full-Stack-Transformer-Inference-Optimization-43124c3688e14cffaf2f1d6cbdf26c6c|title=Towards 100x Speedup: Full Stack Transformer Inference Optimization|first=Yao|last=Fu|date=2023-12-11|website=yaofu.notion.site}}</ref> is a method to accelerate token decoding. Similarly to speculative execution in CPUs, future tokens are computed quickly, then verified. If the quickly computed tokens are incorrect, they are discarded and computed slowly.

The key factor in speculative decoding is that a transformer decoder can verify faster than it can decode, in the following sense.

Suppose we have two transformer models like GPT-3 and GPT-3-small, both with a context window size of 512. To generate an entire context window autoregressively with greedy decoding with GPT-3, it must be run for 512 times, each time generating a token <math>x_1, x_2, ..., x_{512}</math>, taking time <math>512 T_{\text{GPT-3}}</math>. However, if we had some educated guess for the values of these tokens, we could verify all of them in parallel, in one run of the model, by checking that each <math>x_t</math> is indeed the token with the largest log-likelihood in the <math>t</math>-th output.

In speculative decoding, a smaller model or some other simple heuristic is used to generate a few speculative tokens that are subsequently verified by the larger model. For example, suppose we use GPT-3-small to generate four speculative tokens: <math>\tilde{x}_1, \tilde{x}_2, \tilde{x}_3, \tilde{x}_4</math>. This only takes <math>4 T_{\text{GPT-3-small}}</math>. These tokens are then run through the larger GPT-3 in one go. Suppose that <math>\tilde{x}_1</math> and <math>\tilde{x}_2</math> are verified by GPT-3 as what it would have picked, then those are kept, but <math>\tilde{x}_3</math> is not, so <math>\tilde{x}_3, \tilde{x}_4</math> are discarded, and GPT-3 is run on those. This would take <math>4 T_{\text{GPT-3-small}} + 3 T_{\text{GPT-3}}</math>, which might be shorter than <math>4 T_{\text{GPT-3}}</math>.

For non-greedy decoding, similar ideas apply, except the speculative tokens are accepted or rejected stochastically, in a way that guarantees the final output distribution is the same as if speculative decoding was not used.<ref name=":2" /><ref>{{Citation |last1=Chen |first1=Charlie |title=Accelerating Large Language Model Decoding with Speculative Sampling |date=2023-02-02 |arxiv=2302.01318 |last2=Borgeaud |first2=Sebastian |last3=Irving |first3=Geoffrey |last4=Lespiau |first4=Jean-Baptiste |last5=Sifre |first5=Laurent |last6=Jumper |first6=John}}</ref> thumb|Multi-token prediction {{Anchor|Multi-Token Prediction}}In Multi-Token Prediction, a single forward pass creates a final embedding vector, which then is un-embedded into a token probability. However, that vector can then be further processed by another transformer block to predict the ''next'' token, and so on for arbitrarily many steps into the future. This trades off accuracy for speed, since each new token costs just one more transformer block, rather than the entire stack.<ref>{{cite arXiv |eprint=2404.19737 |last1=Gloeckle |first1=Fabian |author2=Badr Youbi Idrissi |last3=Rozière |first3=Baptiste |last4=Lopez-Paz |first4=David |last5=Synnaeve |first5=Gabriel |title=Better & Faster Large Language Models via Multi-token Prediction |date=2024 |class=cs.CL }}</ref><ref>{{cite arXiv |eprint=2412.19437 |author1=DeepSeek-AI |last2=Liu |first2=Aixin |last3=Feng |first3=Bei |last4=Xue |first4=Bing |last5=Wang |first5=Bingxuan |last6=Wu |first6=Bochao |last7=Lu |first7=Chengda |last8=Zhao |first8=Chenggang |last9=Deng |first9=Chengqi |last10=Zhang |first10=Chenyu |last11=Ruan |first11=Chong |last12=Dai |first12=Damai |last13=Guo |first13=Daya |last14=Yang |first14=Dejian |last15=Chen |first15=Deli |last16=Ji |first16=Dongjie |last17=Li |first17=Erhang |last18=Lin |first18=Fangyun |last19=Dai |first19=Fucong |last20=Luo |first20=Fuli |last21=Hao |first21=Guangbo |last22=Chen |first22=Guanting |last23=Li |first23=Guowei |last24=Zhang |first24=H. |last25=Bao |first25=Han |last26=Xu |first26=Hanwei |last27=Wang |first27=Haocheng |last28=Zhang |first28=Haowei |last29=Ding |first29=Honghui |last30=Xin |first30=Huajian |title=DeepSeek-V3 Technical Report |date=2024 |class=cs.CL |display-authors=1 }}</ref>

=== Sub-quadratic transformers === Training transformer-based architectures can be expensive, especially for long inputs.<ref name="reformer">{{cite arXiv |eprint=2001.04451 |class=cs.LG |first1=Nikita |last1=Kitaev |first2=Łukasz |last2=Kaiser |title=Reformer: The Efficient Transformer |last3=Levskaya |first3=Anselm |year=2020}}</ref> Many methods have been developed to attempt to address the issue. In the image domain, Swin transformer is an efficient architecture that performs attention inside shifting windows.<ref>{{Cite book |last1=Liu |first1=Ze |last2=Lin |first2=Yutong |last3=Cao |first3=Yue |last4=Hu |first4=Han |last5=Wei |first5=Yixuan |last6=Zhang |first6=Zheng |last7=Lin |first7=Stephen |last8=Guo |first8=Baining |chapter=Swin Transformer: Hierarchical Vision Transformer using Shifted Windows |year=2021 |title=2021 IEEE/CVF International Conference on Computer Vision (ICCV) |publisher=IEEE |pages=9992–10002 |doi=10.1109/ICCV48922.2021.00986 |isbn=978-1-6654-2812-5|arxiv=2103.14030 }}</ref> In the audio domain, SepTr decouples the attention in time and frequency domains.<ref>{{Cite journal |last1=Ristea |first1=Nicolaea Catalin |last2=Ionescu |first2=Radu Tudor |last3=Khan |first3=Fahad Shahbaz |date=2022-09-18 |title=SepTr: Separable Transformer for Audio Spectrogram Processing |url=https://www.isca-archive.org/interspeech_2022/ristea22_interspeech.html |journal=Interspeech |language=en |publisher=ISCA |pages=4103–4107 |doi=10.21437/Interspeech.2022-249|arxiv=2203.09581 }}</ref> ''Long Range Arena'' (2020)<ref>{{cite arXiv |eprint=2011.04006 |class=cs.LG |first1=Yi |last1=Tay |first2=Mostafa |last2=Dehghani |title=Long Range Arena: A Benchmark for Efficient Transformers |date=2020-11-08 |last3=Abnar |first3=Samira |last4=Shen |first4=Yikang |last5=Bahri |first5=Dara |last6=Pham |first6=Philip |last7=Rao |first7=Jinfeng |last8=Yang |first8=Liu |last9=Ruder |first9=Sebastian |last10=Metzler |first10=Donald}}</ref> is a standard benchmark for comparing the behavior of transformer architectures over long inputs.

==== Alternative attention graphs ==== The standard attention graph is either all-to-all or causal, both of which scales as <math>O(N^2)</math> where <math>N</math> is the number of tokens in a sequence.

Reformer (2020)<ref name="reformer" /><ref>{{cite web |date=16 January 2020 |title=Reformer: The Efficient Transformer |url=http://ai.googleblog.com/2020/01/reformer-efficient-transformer.html |url-status=live |archive-url=https://web.archive.org/web/20201022210019/https://ai.googleblog.com/2020/01/reformer-efficient-transformer.html |archive-date=2020-10-22 |access-date=2020-10-22 |website=Google AI Blog}}</ref> reduces the computational load from <math>O(N^2)</math> to <math>O(N\ln N)</math> by using locality-sensitive hashing and reversible layers.<ref>{{Cite journal |last1=Gomez |first1=Aidan N |last2=Ren |first2=Mengye |last3=Urtasun |first3=Raquel |last4=Grosse |first4=Roger B |date=2017 |title=The Reversible Residual Network: Backpropagation Without Storing Activations |url=https://proceedings.neurips.cc/paper/2017/hash/f9be311e65d81a9ad8150a60844bb94c-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=30|arxiv=1707.04585 }}</ref>

Sparse attention<ref>{{Citation |last1=Child |first1=Rewon |title=Generating Long Sequences with Sparse Transformers |date=2019-04-23 |arxiv=1904.10509 |last2=Gray |first2=Scott |last3=Radford |first3=Alec |last4=Sutskever |first4=Ilya}}</ref> uses attention graphs that grows slower than <math>O(N^2)</math>. For example, BigBird (2020)<ref>{{cite web |date=25 March 2021 |title=Constructing Transformers For Longer Sequences with Sparse Attention Methods |url=https://ai.googleblog.com/2021/03/constructing-transformers-for-longer.html |url-status=live |archive-url=https://web.archive.org/web/20210918150757/https://ai.googleblog.com/2021/03/constructing-transformers-for-longer.html |archive-date=2021-09-18 |access-date=2021-05-28 |website=Google AI Blog}}</ref> uses random small-world networks which grows as <math>O(N)</math>.

Ordinary transformers require a memory size that is quadratic in the size of the context window. Attention-free transformers<ref>{{cite arXiv |eprint=2105.14103 |class=cs.LG |first1=Shuangfei |last1=Zhai |first2=Walter |last2=Talbott |title=An Attention Free Transformer |date=2021-09-21 |last3=Srivastava |first3=Nitish |last4=Huang |first4=Chen |last5=Goh |first5=Hanlin |last6=Zhang |first6=Ruixiang |last7=Susskind |first7=Josh}}</ref> reduce this to a linear dependence while still retaining the advantages of a transformer by linking the key to the value.

==== Random Feature Attention ==== Random Feature Attention (2021)<ref>{{cite arXiv |last1=Peng |first1=Hao |last2=Pappas |first2=Nikolaos |last3=Yogatama |first3=Dani |last4=Schwartz |first4=Roy |last5=Smith |first5=Noah A. |last6=Kong |first6=Lingpeng |date=2021-03-19 |title=Random Feature Attention |class=cs.CL |eprint=2103.02143}}</ref> uses Fourier random features:<math display="block">\varphi(x) = \frac{1}{\sqrt D}[\cos\langle w_1, x\rangle, \sin\langle w_1, x\rangle, \cdots \cos\langle w_D, x\rangle, \sin\langle w_D, x\rangle]^T</math>where <math>w_1, ..., w_D</math> are independent samples from the normal distribution <math>N(0, \sigma^2 I)</math>. This choice of parameters satisfy <math>\mathbb E[\langle \varphi(x), \varphi(y)\rangle] = e^{-\frac{\|x-y\|^2}{2\sigma^2}}</math>, or <math display="block">e^{\langle x, y\rangle/\sigma^2} = \mathbb E[\langle e^{\|x\|^2/2\sigma^2} \varphi(x), e^{\|y\|^2/2\sigma^2}\varphi(y)\rangle] \approx \langle e^{\|x\|^2/2\sigma^2} \varphi(x), e^{\|y\|^2/2\sigma^2}\varphi(y)\rangle </math>Consequently, the one-headed attention, with one query, can be written as <math display="block"> \text{Attention}(q, K, V) = \text{softmax}\left(\frac{qK^\mathrm{T}}{\sqrt{d_k}}\right)V

\approx \frac{\varphi(q)^T \sum_i e^{\|k_i\|^2/2\sigma^2}\varphi(k_i) v_i^T}{\varphi(q)^T \sum_i e^{\|k_i\|^2/2\sigma^2}\varphi(k_i)}</math>where <math>\sigma = d_K^{1/4}</math>. Similarly for multiple queries, and for multihead attention.

This approximation can be computed in linear time, as we can compute the matrix <math>\varphi(k_i) v_i^T</math> first, then multiply it with the query. In essence, we have managed to obtain a more precise version of <math display="block">\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\mathrm{T}}{\sqrt{d_k}}\right)V \approx Q(K^TV/\sqrt{d_k}) </math>Performer (2022)<ref>{{cite arXiv |last1=Choromanski |first1=Krzysztof |last2=Likhosherstov |first2=Valerii |last3=Dohan |first3=David |last4=Song |first4=Xingyou |last5=Gane |first5=Andreea |last6=Sarlos |first6=Tamas |last7=Hawkins |first7=Peter |last8=Davis |first8=Jared |last9=Belanger |first9=David |last10=Colwell |first10=Lucy |last11=Weller |first11=Adrian |date=2020-09-30 |title=Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers |class=cs.LG |eprint=2006.03555}}</ref> uses the same Random Feature Attention, but <math>w_1, ..., w_D</math> are first independently sampled from the normal distribution <math>N(0, \sigma^2 I)</math>, then they are Gram–Schmidt processed.

=== Multimodality === Transformers can also be used/adapted for modalities (input or output) beyond just text, usually by finding a way to "tokenize" the modality.

Multimodal models can either be trained from scratch, or by finetuning. A 2022 study found that transformers pretrained only on natural language can be finetuned on only 0.03% of parameters and become competitive with LSTMs on a variety of logical and visual tasks, demonstrating transfer learning.<ref>{{Cite journal |last1=Lu |first1=Kevin |last2=Grover |first2=Aditya |last3=Abbeel |first3=Pieter |last4=Mordatch |first4=Igor |date=2022-06-28 |title=Frozen Pretrained Transformers as Universal Computation Engines |url=https://ojs.aaai.org/index.php/AAAI/article/view/20729 |journal=Proceedings of the AAAI Conference on Artificial Intelligence |language=en |volume=36 |issue=7 |pages=7628–7636 |doi=10.1609/aaai.v36i7.20729 |issn=2374-3468|doi-access=free }}</ref> The LLaVA was a vision-language model composed of a language model (Vicuna-13B)<ref>{{Cite web |title=Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality {{!}} LMSYS Org |url=https://lmsys.org/blog/2023-03-30-vicuna |access-date=2024-08-11 |website=lmsys.org |date=30 March 2023 |language=en}}</ref> and a vision model (ViT-L/14), connected by a linear layer. Only the linear layer is finetuned.<ref>{{Cite journal |last1=Liu |first1=Haotian |last2=Li |first2=Chunyuan |last3=Wu |first3=Qingyang |last4=Lee |first4=Yong Jae |date=2023-12-15 |title=Visual Instruction Tuning |url=https://proceedings.neurips.cc/paper_files/paper/2023/hash/6dcf277ea32ce3288914faf369fe6de0-Abstract-Conference.html |journal=Advances in Neural Information Processing Systems |language=en |volume=36 |pages=34892–34916}}</ref>

Vision transformers<ref name="auto2" /> adapt the transformer to computer vision by breaking down input images as a series of patches, turning them into vectors, and treating them like embedding vector of tokens in a standard transformer.

Conformer<ref name="Gulati2020">{{cite arXiv |eprint=2005.08100 |first1=Anmol |last1=Gulati |first2=James |last2=Qin |title=Conformer: Convolution-augmented Transformer for Speech Recognition |last3=Chiu |first3=Chung-Cheng |last4=Parmar |first4=Niki |last5=Zhang |first5=Yu |last6=Yu |first6=Jiahui |last7=Han |first7=Wei |last8=Wang |first8=Shibo |last9=Zhang |first9=Zhengdong |last10=Wu |first10=Yonghui |last11=Pang |first11=Ruoming |year=2020 |page=|class=eess.AS }}</ref> and later Whisper<ref name="Radford Kim Xu Brockman p.">{{cite arXiv |eprint=2212.04356 |first1=Alec |last1=Radford |first2=Jong Wook |last2=Kim |title=Robust Speech Recognition via Large-Scale Weak Supervision |last3=Xu |first3=Tao |last4=Brockman |first4=Greg |last5=McLeavey |first5=Christine |last6=Sutskever |first6=Ilya |year=2022 |page=|class=eess.AS }}</ref> follow the same pattern for speech recognition, first turning the speech signal into a spectrogram, which is then treated like an image, i.e. broken down into a series of patches, turned into vectors and treated like embedding vector of tokens in a standard transformer.

Perceivers<ref name="perceiver2021">{{cite arXiv |eprint=2103.03206 |class=cs.CV |first1=Andrew |last1=Jaegle |first2=Felix |last2=Gimeno |title=Perceiver: General Perception with Iterative Attention |date=2021-06-22 |last3=Brock |first3=Andrew |last4=Zisserman |first4=Andrew |last5=Vinyals |first5=Oriol |last6=Carreira |first6=Joao}}</ref><ref name="jaegle2021b">{{cite arXiv |eprint=2107.14795 |class=cs.LG |first1=Andrew |last1=Jaegle |first2=Sebastian |last2=Borgeaud |title=Perceiver IO: A General Architecture for Structured Inputs & Outputs |date=2021-08-02 |last3=Alayrac |first3=Jean-Baptiste |last4=Doersch |first4=Carl |last5=Ionescu |first5=Catalin |last6=Ding |first6=David |last7=Koppula |first7=Skanda |last8=Zoran |first8=Daniel |last9=Brock |first9=Andrew |last10=Shelhamer |first10=Evan |last11=Hénaff |first11=Olivier}}</ref> are a variant of transformers designed for multimodality.

For image generation, notable architectures are DALL-E 1 (2021), Parti (2022),<ref>{{Cite web |title=Parti: Pathways Autoregressive Text-to-Image Model |url=https://sites.research.google/parti/ |access-date=2024-08-09 |website=sites.research.google}}</ref> Phenaki (2023),<ref name=":13">{{Cite arXiv |last1=Villegas |first1=Ruben |last2=Babaeizadeh |first2=Mohammad |last3=Kindermans |first3=Pieter-Jan |last4=Moraldo |first4=Hernan |last5=Zhang |first5=Han |last6=Saffar |first6=Mohammad Taghi |last7=Castro |first7=Santiago |last8=Kunze |first8=Julius |last9=Erhan |first9=Dumitru |date=2022-09-29 |title=Phenaki: Variable Length Video Generation from Open Domain Textual Descriptions |language=en |eprint=2210.02399 |class=cs.CV}}</ref> and Muse (2023).<ref name=":12">{{cite arXiv |last1=Chang |first1=Huiwen |title=Muse: Text-To-Image Generation via Masked Generative Transformers |date=2023-01-02 |eprint=2301.00704 |last2=Zhang |first2=Han |last3=Barber |first3=Jarred |last4=Maschinot |first4=A. J. |last5=Lezama |first5=Jose |last6=Jiang |first6=Lu |last7=Yang |first7=Ming-Hsuan |author7-link=Ming-Hsuan Yang |last8=Murphy |first8=Kevin |last9=Freeman |first9=William T.|class=cs.CV }}</ref> Unlike later models, DALL-E is not a diffusion model. Instead, it uses a decoder-only transformer that autoregressively generates a text, followed by the token representation of an image, which is then converted by a variational autoencoder to an image.<ref>{{Citation |last1=Ramesh |first1=Aditya |title=Zero-Shot Text-to-Image Generation |date=2021-02-26 |arxiv=2102.12092 |last2=Pavlov |first2=Mikhail |last3=Goh |first3=Gabriel |last4=Gray |first4=Scott |last5=Voss |first5=Chelsea |last6=Radford |first6=Alec |last7=Chen |first7=Mark |last8=Sutskever |first8=Ilya}}</ref> Parti is an encoder–decoder transformer, where the encoder processes a text prompt, and the decoder generates a token representation of an image.<ref>{{Citation |last1=Yu |first1=Jiahui |title=Scaling Autoregressive Models for Content-Rich Text-to-Image Generation |date=2022-06-21 |arxiv=2206.10789 |last2=Xu |first2=Yuanzhong |last3=Koh |first3=Jing Yu |last4=Luong |first4=Thang |last5=Baid |first5=Gunjan |last6=Wang |first6=Zirui |last7=Vasudevan |first7=Vijay |last8=Ku |first8=Alexander |last9=Yang |first9=Yinfei}}</ref> Muse is an encoder-only transformer that is trained to predict masked image tokens from unmasked image tokens. During generation, all input tokens are masked, and the highest-confidence predictions are included for the next iteration, until all tokens are predicted.<ref name=":12" /> Phenaki is a text-to-video model. It is a bidirectional masked transformer conditioned on pre-computed text tokens. The generated tokens are then decoded to a video.<ref name=":13" />

== Applications == The transformer has had great success in natural language processing (NLP). Many large language models such as GPT-2, GPT-3, GPT-4, Gemini, AlbertAGPT, Claude, BERT, Grok, XLNet, RoBERTa and ChatGPT demonstrate the ability of transformers to perform a wide variety of NLP-related subtasks and their related real-world applications, including: * machine translation * time series prediction * document summarization * document generation * named entity recognition (NER)<ref>{{cite journal |last1=Kariampuzha |first1=William |last2=Alyea |first2=Gioconda |last3=Qu |first3=Sue |last4=Sanjak |first4=Jaleal |last5=Mathé |first5=Ewy |last6=Sid |first6=Eric |last7=Chatelaine |first7=Haley |last8=Yadaw |first8=Arjun |last9=Xu |first9=Yanji |last10=Zhu |first10=Qian |date=2023 |title=Precision information extraction for rare disease epidemiology at scale |journal=Journal of Translational Medicine |volume=21 |issue=1 |page=157 |doi=10.1186/s12967-023-04011-y |pmc=9972634 |pmid=36855134 |doi-access=free}}</ref> * writing computer code based on requirements expressed in natural language. * speech-to-text

Beyond traditional NLP, the transformer architecture has had success in other applications, such as: * Disaster response<ref name="disaster"/> * biological sequence analysis * video understanding * protein folding (such as AlphaFold) * evaluating chess board positions. Using static evaluation alone (that is, with no Minimax search) transformer achieved an Elo of 2895, putting it at grandmaster level.<ref name="grandmaster" />

== See also == * {{annotated link|seq2seq}} * {{annotated link|Circuit (neural network)}} * {{annotated link|Perceiver}} * {{annotated link|Vision transformer}} * {{annotated link|Large language model}} * {{annotated link|BERT (language model)}} * {{annotated link|Generative pre-trained transformer}} *{{annotated link|T5 (language model)}}

== Notes == {{reflist|group=note}}

== References == {{Reflist}}

== Further reading == {{refbegin}} * Alexander Rush, [https://nlp.seas.harvard.edu/2018/04/03/attention.html The Annotated transformer] {{Webarchive|url=https://web.archive.org/web/20210922093841/https://nlp.seas.harvard.edu/2018/04/03/attention.html |date=2021-09-22 }}, Harvard NLP group, 3 April 2018 * {{cite arXiv |last1=Phuong |first1=Mary |last2=Hutter |first2=Marcus |title=Formal Algorithms for Transformers |date=2022 |class=cs.LG |eprint=2207.09238 }} * {{cite arXiv |last1=Ferrando |first1=Javier |title=A Primer on the Inner Workings of Transformer-based Language Models |date=2024-05-01 |eprint=2405.00208 |last2=Sarti |first2=Gabriele |last3=Bisazza |first3=Arianna |last4=Costa-jussà |first4=Marta R.|class=cs.CL }} * {{Cite web |title=Transformer++ |first=Gavin|last=Leech|url=https://www.gleech.org/tplus |date=2024-11-06|archive-url=https://web.archive.org/web/20250226110336/https://www.gleech.org/tplus|archive-date=2025-02-26|access-date=2025-05-08 |website=argmin gravitas}} * {{cite patent | country = US | number = 10452978 | status = patent | title = Attention-based sequence transduction neural networks | gdate = 2019-10-22 | fdate = 2018-06-28 | pridate = 2018-06-28 | inventor = Noam M. Shazeer; Aidan Nicholas Gomez; Lukasz Mieczyslaw Kaiser; Jakob D. Uszkoreit; Llion Owen Jones; Niki J. Parmar; Illia Polosukhin; Ashish Teku Vaswani | assign1 = Google LLC }} * {{cite web | last=Raschka | first=Sebastian | title=The Big LLM Architecture Comparison: From DeepSeek V3 to GLM-5: A Look At Modern LLM Architecture Design | website=Sebastian Raschka’s AI Magazine | date=2026-03-11 | url=https://magazine.sebastianraschka.com/p/the-big-llm-architecture-comparison | access-date=2026-03-25 }} {{refend}}

Category:Google software Category:Neural network architectures Category:2017 in artificial intelligence