the discoveries which led to the invention of the Transformer, the architecture underlying today's generative AI models - john15263

# [the discoveries which led to the invention of the Transformer, the architecture underlying today's generative AI models](https://twitter.com/tsarnick/status/1807883460847325235) ![the discoveries which led to the invention of the Transformer, the architecture underlying today's generative AI models](https://twitter.com/tsarnick/status/1807883460847325235) *** <iframe title="Aravind Srinivas: Perplexity CEO on Future of AI, Search & the Internet | Lex Fridman Podcast #434 - YouTube" src="https://www.youtube.com/embed/e-gwvmhyU7A?feature=oembed" height="113" width="200" allowfullscreen="" allow="fullscreen" style="aspect-ratio: 1.76991 / 1; width: 100%; height: 100%;"></iframe> #### [[👤Aravind Srinivas]]: Yoshua Bengio and Dimitri Bedano wrote a groundbreaking paper on Soft Attention, first applied in the Align and Translate paper. This marked a significant development in AI. Before this, Ilya Sutskever demonstrated that you could train a simple RNN model, scale it up, and surpass all phrase-based machine translation systems. However, this approach was brute force and lacked attention mechanisms, requiring substantial computational power—around a 400 million parameter model back then. A grad student, Bedano, in Bengio's lab, identified the potential of attention mechanisms, leading to superior results with less compute. This highlighted the efficacy of attention. DeepMind's work on Pixel RNNs, which later evolved into WaveNet, showed that even RNNs were not necessary. They discovered that a fully convolutional model could handle autoregressive modeling using masked convolutions, allowing for parallel training instead of sequential backpropagation through time. This approach utilized GPU compute more efficiently, as it involved only matrix multiplications. Building on these insights, the Google Brain team, led by Vaswani et al., combined attention mechanisms with the parallelism of WaveNet's convolutional models in their transformer paper. They realized that attention is more powerful than convolutions for learning higher-order dependencies due to its multiplicative compute capabilities. The transformer model took the best elements of both approaches: attention mechanisms and parallelizable convolutions. Since the introduction of transformers in 2017, there haven't been fundamental changes to this architecture. It remains the state-of-the-art solution in many areas of AI, proving the strength and adaptability of the transformer model. #🤖Aravind_Srinivas