Transformers are a deep learning architecture that **revolutionized [[Natural Language Processing]] (NLP)** by replacing RNNs with a fully attention-based mechanism. They are used in models like **BERT, GPT, and Vision Transformers (ViTs)**.
### **Key Concept: Attention Mechanism**
The **attention mechanism** allows the model to **focus on important parts** of the input, improving context understanding. Instead of processing data sequentially (like RNNs), transformers **analyze the entire sequence at once**.
![[Transformer.png | 600]]
**Self-Attention Formula:**
${Attention}(Q, K, V) = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right) V$
Where:
- $Q (Query), K (Key), V (Value)$ – Representations of input words.
- $d_k$ – Scaling factor.
- $softmax$ – Assigns importance scores to words.
### **Transformer Architecture**
1️. **Input Embeddings** – Converts words into numerical vectors.
2️. **Positional Encoding** – Adds sequence order information.
3️. **Self-Attention Layers** – Determines relationships between words.
4️. **Feedforward Layers** – Processes attention outputs.
5️. **Final Output Layer** – Produces predictions (e.g., next word in a sentence).
### **Why Transformers?**
- **Handles long-range dependencies** better than RNNs.
- **Processes entire input in parallel**, making it faster.
- **Scalable** for large datasets and complex models.
### ** Applications of Transformers**
- **Language Models** – GPT, BERT, T5.
- **Machine Translation** – Google Translate.
- **Image Processing** – Vision Transformers (ViTs).