Understanding Generative Pre-trained Transformer - Scott A. Strong

Based on the restructured project outline, here's what each student will need to complete and how they should approach their tasks: ## Student 1: Word Representations and Embeddings **What needs to be completed:** 1. A written explanation of how words are represented as vectors 2. A practical demonstration of word embeddings with similarity calculations 3. Visual representation of these embeddings **How to complete it:** 1. Study Chapter 6 of [Jurafsky & Martin's textbook](https://web.stanford.edu/~jurafsky/slp3/) 2. Choose 8-10 semantically related words (e.g., "king," "queen," "man," "woman") 3. Use Python or MATLAB with libraries like NumPy to: - Create or extract simple word vectors (could use a small pre-trained set) - Implement cosine similarity calculations between word pairs - Generate a 2D visualization using dimensionality reduction (PCA or t-SNE) 4. Document the process, explaining the mathematical concepts involved 5. Show examples of semantic relationships captured in the vector space ## Student 2: Basic Attention Mechanisms **What needs to be completed:** 1. An explanation of attention mechanism mathematics 2. A toy example calculation of attention scores and weights 3. Demonstration of how attention creates contextual representations **How to complete it:** 1. Study "[The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/)" blog by Jay Alammar 2. Create a minimal example with: - 3-4 artificial "tokens" with simple vector representations - Defined query, key, and value vectors (can be randomly generated) - Step-by-step calculation of attention scores (QK^T) - Application of softmax to normalize scores into weights - Final weighted sum to produce attention output 3. Include all mathematical steps with clear notations 4. Visualize the attention weights (e.g., as a heatmap) ## Student 3: Model Architecture and Information Flow **What needs to be completed:** 1. Description of how information flows through transformer layers 2. Step-by-step computation through a simplified network 3. Explanation of how different components connect **How to complete it:** 1. Study relevant sections of "[Deep Learning](https://www.deeplearningbook.org/)" by Goodfellow 2. Create a simplified transformer "layer" with: - A small input vector (e.g., representing a token) - Pre-defined weight matrices for transformations - ReLU activation function implementation - Forward propagation through multiple layers 3. Trace and document all values at each step of computation 4. Create a flow diagram showing how data transforms through the network 5. Explain the mathematical purpose of each transformation ## Student 4: Training and Loss Functions **What needs to be completed:** 1. Explanation of optimization in language models 2. Implementation of basic gradient descent 3. Demonstration of convergence behavior **How to complete it:** 1. Study relevant sections from "[Mathematics for Machine Learning"](https://mml-book.github.io/) 2. Define a simple loss function (quadratic is suggested) 3. Manually derive the gradient/derivative 4. Implement gradient descent in code: - Initialize parameters randomly - Compute loss at each iteration - Update parameters using the gradient and learning rate - Track changes over iterations 5. Create a visualization showing how loss decreases over iterations 6. Explain how this connects to training language models ## Student 5: Evaluation and Probability **What needs to be completed:** 1. Explanation of probabilistic language modeling 2. Calculation of cross-entropy and perplexity metrics 3. Interpretation of these evaluation metrics **How to complete it:** 1. Study relevant sections from "[Introduction to Statistical Learning](https://www.statlearning.com/)" 2. Create a toy language model example: - Define a small vocabulary (5-10 words) - Assign probability distributions for next-word prediction - Generate "ground truth" and "predicted" distributions 3. Calculate cross-entropy between distributions 4. Compute perplexity based on the cross-entropy 5. Provide an interpretation of what these values mean 6. Explain how these metrics relate to model performance ## Integration Requirements All students will need to: 1. Participate in weekly check-ins (20 minutes) 2. Contribute to a shared computational notebook 3. Help develop a visual diagram showing how their component connects to others 4. Write their section of the final report (2-3 pages) 5. Review and provide feedback on other sections The final integrated project should demonstrate how these mathematical concepts work together to form the basis of GPT models, with each student's work flowing naturally into the next component.