Transformers and Attention Mechanism - ML Pathway

Transformers are a deep learning architecture that **revolutionized [[Natural Language Processing]] (NLP)** by replacing RNNs with a fully attention-based mechanism. They are used in models like **BERT, GPT, and Vision Transformers (ViTs)**. ### **Key Concept: Attention Mechanism** The **attention mechanism** allows the model to **focus on important parts** of the input, improving context understanding. Instead of processing data sequentially (like RNNs), transformers **analyze the entire sequence at once**. ![[Transformer.png | 600]] **Self-Attention Formula:** ${Attention}(Q, K, V) = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right) V$ Where: - $Q (Query), K (Key), V (Value)$ – Representations of input words. - $d_k$ – Scaling factor. - $softmax$ – Assigns importance scores to words. ### **Transformer Architecture** 1️. **Input Embeddings** – Converts words into numerical vectors. 2️. **Positional Encoding** – Adds sequence order information. 3️. **Self-Attention Layers** – Determines relationships between words. 4️. **Feedforward Layers** – Processes attention outputs. 5️. **Final Output Layer** – Produces predictions (e.g., next word in a sentence). ### **Why Transformers?** - **Handles long-range dependencies** better than RNNs. - **Processes entire input in parallel**, making it faster. - **Scalable** for large datasets and complex models. ### ** Applications of Transformers** - **Language Models** – GPT, BERT, T5. - **Machine Translation** – Google Translate. - **Image Processing** – Vision Transformers (ViTs).