LLM - Detailed dive into LLM w Andrej Karpathy - follow the idea

%% created:: 2025-02-14 10:02 modified:: 2025-02-14 10:02 %% 2025-02-14 10:02 [summary](https://app.getrecall.ai/share/ed91995b-d63f-5b42-845d-3f9fd8afe3d1) # Deep Dive into LLMs like ChatGPT ![Deep Dive into LLMs like ChatGPT - YouTube](https://youtu.be/7xTGNNLPyMI) SUMMARY This is a comprehensive introduction to large language models (LLMs) like ChatGPT, covering their capabilities, limitations, potential risks, and underlying mechanisms. The content explores the entire pipeline of how LLMs are built, from pre-training through post-training stages, and discusses emerging developments in the field. ### Pre-training and Data - LLM training starts with downloading and processing internet content - Data collection involves filtering through Common Crawl data - Text extraction and cleaning process - Tokenization converts text into sequences of symbols - Neural networks process token sequences ### Model Architecture and Training - Neural networks mix inputs with parameters - Transformer architecture as key component - Inference process generates new text - Models need tokens to process information - Limitations in computational capacity per token ### Post-training and Capabilities - Supervised fine-tuning with conversation data - Reinforcement learning techniques - Tool use and knowledge integration - Emergence of "thinking" capabilities - Handling of various domains ### Current Limitations and Future Directions - Hallucinations and accuracy issues - Jagged intelligence across different tasks - Development of multimodal capabilities - Integration of agents and automation - Ongoing research and improvements ### Development and Access - Various providers and model types - Open source vs proprietary models - Leaderboards and evaluation methods - Resources for staying informed - Local vs cloud deployment options | Stage | Purpose | Key Components | | ---------------------- | ----------------------------- | ---------------------------------- | | Pre-training | Build knowledge base | Internet data, tokenization | | Fine-tuning | Create assistant capabilities | Conversation data, human feedback | | Reinforcement Learning | Improve reasoning | Practice problems, reward modeling | | Deployment | User interaction | Inference, tool integration | --- --- --- ## [](#introduction-$00%3A00%3A00$)introduction [(00:00:00)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=0s) - This video aims to provide a comprehensive and general audience introduction to large language models like [ChatGPT](https://en.wikipedia.org/wiki/ChatGPT), covering their capabilities, limitations, and potential risks [(00:00:00)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=0s). - The goal is to give viewers mental models for understanding what large language models are, how they work, and what they can and cannot do [(00:00:11)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=11s). - Large language models are described as magical and amazing in some respects, but also have limitations and sharp edges that users should be aware of [(00:00:17)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=17s). - The video will explore the entire pipeline of how large language models are built, including what happens when users input text and how the models generate responses [(00:00:40)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=40s). - The explanation will be kept accessible to a general audience, covering topics such as the cognitive and psychological implications of these tools [(00:00:45)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=45s). - The video will specifically look at how to build a [Large language model](https://en.wikipedia.org/wiki/Large_language_model) like ChatGPT, exploring its underlying mechanisms and capabilities [(00:00:59)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=59s). ## [](#pretraining-data-$internet$-$00%3A01%3A00$)pretraining data (internet) [(00:01:00)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=60s) - The process of training large language models (LLMs) like [ChatGPT](https://en.wikipedia.org/wiki/ChatGPT) involves multiple stages arranged sequentially, starting with the pre-training stage, which includes downloading and processing the internet to obtain a large quantity of high-quality documents with diverse content [(00:01:02)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=62s). - The goal of this stage is to collect a huge amount of text from publicly available sources, such as the internet, and to achieve a large diversity of high-quality documents [(00:01:41)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=101s). - Companies like [Hugging Face](https://en.wikipedia.org/wiki/Hugging_Face) have created and curated datasets, such as the Fine Web dataset, which is a collection of high-quality documents from the internet [(00:01:16)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=76s). - The Fine Web dataset is a representative example of what major [Large language model](https://en.wikipedia.org/wiki/Large_language_model) providers like [OpenAI](https://en.wikipedia.org/wiki/OpenAI), [Anthropic](https://en.wikipedia.org/wiki/Anthropic), and [Google](https://en.wikipedia.org/wiki/Google) use internally, and it contains around 44 terabytes of data [(00:02:13)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=133s). - This dataset is not extremely large, and it can fit on a single hard drive or a USB stick, despite being a massive collection of text data [(00:02:20)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=140s). - The data collection process starts with common crawl, an organization that has been indexing the internet since 2007 and has crawled over 2.7 billion web pages as of 2024 [(00:02:52)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=172s). - The common crawl data is raw and undergoes multiple filtering stages, including URL filtering, which eliminates unwanted websites such as malware, spam, and adult sites [(00:03:27)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=207s). - The text extraction stage involves removing markup and other computer code from the raw HTML of web pages to obtain clean text data [(00:04:06)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=246s). - The goal is to extract just the text from web pages, excluding navigation and other unwanted content, which requires filtering and processing to obtain good content from the web pages [(00:04:33)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=273s). - Language filtering is applied to guess the language of each web page, and only pages with more than 65% of a specific language, such as [English language](https://en.wikipedia.org/wiki/English_language), are kept in the dataset [(00:04:50)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=290s). - This is a design decision that different companies can take, deciding what fraction of languages to include in their dataset, which affects the model’s performance in different languages [(00:04:56)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=296s). - Fine Web is focused on English, so their language model will be good at English but may not perform well in other languages [(00:05:28)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=328s). - After language filtering, other filtering steps are applied, including duplication removal and personally identifiable information (PII) removal, such as addresses and [Social Security (United States)](https://en.wikipedia.org/wiki/Social_Security_$United_States$) numbers [(00:05:39)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=339s). - The pre-processing stages result in a dataset like Fine Web, which can be downloaded from the [Hugging Face](https://en.wikipedia.org/wiki/Hugging_Face) web page and contains examples of final text that ends up in the training set [(00:06:08)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=368s). - The dataset consists of a large amount of text, 40 terabytes, which is the starting point for the next step of training neural networks on this data [(00:06:56)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=416s). - The text data has patterns, and the goal is to train neural networks to internalize and model how the text flows, creating a giant texture of text that neural nets can mimic [(00:07:28)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=448s). ## [](#tokenization-$00%3A07%3A47$)tokenization [(00:07:47)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=467s) - Before feeding text into neural networks, it is necessary to decide how to represent the text and how to feed it in, as the technology expects a one-dimensional sequence of symbols with a finite set of possible symbols [(00:07:48)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=468s). - The text is initially a one-dimensional sequence, and when represented using [UTF-8](https://en.wikipedia.org/wiki/UTF-8) encoding, it can be broken down into raw bits that correspond to the text in the computer [(00:08:35)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=515s). - The resulting sequence of bits is finite and precious, but having only two possible symbols (0 and 1) results in extremely long sequences, which is not desirable [(00:09:04)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=544s). - To address this, the symbol size (vocabulary) and sequence length are traded off, resulting in more symbols and shorter sequences [(00:09:32)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=572s). - One way to compress the sequence is to group consecutive bits (e.g., eight bits) into a single byte, reducing the sequence length by eight times but increasing the number of possible symbols to 256 [(00:09:51)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=591s). - In production, state-of-the-art language models use the [Byte pair encoding](https://en.wikipedia.org/wiki/Byte_pair_encoding) algorithm to further shrink the sequence length by grouping common consecutive bytes or symbols into new symbols, increasing the symbol size and vocabulary [(00:11:00)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=660s). - A suitable vocabulary size for large language models ([Large language model](https://en.wikipedia.org/wiki/Large_language_model)) is around 100,000 possible symbols, with [GPT-4](https://en.wikipedia.org/wiki/GPT-4) using 100,277 symbols [(00:11:48)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=708s). - The process of converting raw text into these symbols or tokens is called tokenization [(00:12:07)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=727s). - Tokenization can be explored using websites like Tick Tokenizer, which allows users to input text and see its tokenization [(00:12:21)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=741s). - The tokenization process is case-sensitive and can result in different tokens for the same text with different cases or spacing [(00:13:24)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=804s). - The GPT-4 base model tokenizer, CL 100k base, can be used to explore token representations on Tick Tokenizer [(00:12:27)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=747s). - Tokenization can result in a sequence of tokens, with each token having a unique ID, such as the token “hello” with ID 15339 [(00:12:49)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=769s). - The tokenization process can be affected by spacing, with multiple spaces resulting in different tokens [(00:13:11)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=791s). - [GPT-4](https://en.wikipedia.org/wiki/GPT-4) sees text as a one-dimensional sequence of symbols, with each sequence having a length corresponding to the number of tokens [(00:14:01)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=841s). ## [](#neural-network-i%2Fo-$00%3A14%3A27$)neural network I/O [(00:14:27)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=867s) - A sequence of text from a dataset is re-represented using a tokenizer into a sequence of tokens, where each token is a unique ID that represents a little text chunk, also known as an “atom” of the sequence [(00:14:32)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=872s). - The dataset mentioned is the Fine web dataset, which contains 44 terabytes of disk space and a 15 trillion token sequence [(00:14:43)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=883s). - The neural network training is where the heavy lifting happens computationally, and the goal is to model the statistical relationships of how tokens follow each other in the sequence [(00:15:19)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=919s). - To achieve this, windows of tokens are taken from the data, and the window length can range from zero to a maximum size, such as 8,000 tokens [(00:15:57)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=957s). - The neural network is trained to predict the token that comes next in the sequence, given a context of previous tokens [(00:16:41)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=1001s). - The input to the neural network is a sequence of tokens of variable length, and the output is a prediction for what comes next, represented as a probability distribution over the vocabulary of 100,277 possible tokens [(00:17:14)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=1034s). - The neural network is initially randomly initialized, resulting in random probabilities for the output, but it is trained to improve its predictions over time [(00:17:39)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=1059s). - The training process involves comparing the predicted probabilities with the actual label, which is the correct token that comes next in the sequence [(00:18:16)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=1096s). - A mathematical process is used to update the neural network, allowing for the tuning of the model to achieve desired outcomes, such as increasing the probability of a specific token from 3% to a higher value [(00:18:22)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=1102s). - The process involves mathematically calculating how to adjust and update the neural network so that the correct answer has a higher probability, and this is done by nudging the model to give a higher probability to the correct token that comes next in the sequence [(00:18:44)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=1124s). - When the neural network is updated, the next time a particular sequence of tokens is fed into the model, it will be slightly adjusted, resulting in a higher probability for the correct token and lower probabilities for other tokens [(00:18:53)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=1133s). - This process happens not just for a single token, but for all tokens in the entire data set, and in practice, little windows or batches of tokens are sampled and adjusted in parallel to update the neural network [(00:19:29)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=1169s). - The goal of this process is to train the neural network so that its predictions match the statistics of what actually happens in the training set, and its probabilities become consistent with the statistical patterns of how tokens follow each other in the data [(00:19:55)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=1195s). - The training process involves a sequence of updates to the neural network, allowing it to learn from the data and make predictions that are consistent with the patterns and statistics of the training set [(00:19:58)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=1198s). ## [](#neural-network-internals-$00%3A20%3A11$)neural network internals [(00:20:11)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=1211s) - Neural networks have inputs that are sequences of tokens, which can range from zero to 8,000 tokens in principle, but are cropped at a certain length due to computational expenses, becoming the maximum context length of the model [(00:20:17)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=1217s). - These inputs are mixed with the parameters or weights of the neural networks in a giant mathematical expression, with modern neural networks having billions of parameters that are initially set randomly [(00:20:48)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=1248s). - The process of iteratively updating the network, known as training a neural network, adjusts the parameters to make the outputs consistent with the patterns seen in the training set [(00:21:22)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=1282s). - The parameters can be thought of as knobs on a DJ set, with training meaning discovering a setting of parameters that seems consistent with the statistics of the training set [(00:21:36)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=1296s). - The mathematical expression is massive, with trillions of terms, but can be broken down into simple operations like multiplication, addition, and exponentiation [(00:22:01)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=1321s). - Neural network architecture research focuses on designing effective mathematical expressions that are expressive, optimizable, and parallelizable [(00:22:32)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=1352s). - The goal is to optimize the parameters of the neural network so that the predictions come out consistent with the training set [(00:22:54)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=1374s). - A production-grade example of a neural network is the [Transformer (deep learning architecture)](https://en.wikipedia.org/wiki/Transformer_$deep_learning_architecture$), which has a special structure and can have millions of parameters, such as the example with 8,500 roughly parameters [(00:23:24)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=1404s). - The Transformer takes input token sequences and produces output predictions through a sequence of transformations and intermediate values [(00:23:36)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=1416s). - Large Language Models ([Large language model](https://en.wikipedia.org/wiki/Large_language_model)) like [ChatGPT](https://en.wikipedia.org/wiki/ChatGPT) work by predicting what comes next in a sequence of tokens, which are embedded into a distributed representation as vectors within a neural network [(00:23:58)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=1438s). - The embedded tokens flow through a series of mathematical expressions, including layer norms, matrix multiplications, and softmax functions, to produce intermediate values that can be thought of as the “firing rates” of synthetic neurons [(00:24:18)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=1458s). - However, these synthetic neurons are extremely simple compared to biological neurons, lacking memory and complexity, and should not be thought of as direct analogues [(00:24:50)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=1490s). - The neural network is a stateless, fixed mathematical expression that transforms inputs into outputs based on a set of parameters, which are adjusted to produce accurate predictions that match the patterns seen in the training set [(00:25:33)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=1533s). - The [Transformer (deep learning architecture)](https://en.wikipedia.org/wiki/Transformer_$deep_learning_architecture$) architecture is a key component of LLMs, consisting of an attention block and a multi-layer perceptron block, which work together to produce predictions [(00:24:27)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=1467s). - The Transformer is a mathematical function parameterized by a fixed set of parameters, and its behavior is determined by the values of these parameters, which are adjusted during training to produce accurate predictions [(00:25:35)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=1535s). ## [](#inference-$00%3A26%3A01$)inference [(00:26:01)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=1561s) - Inference is a major stage of working with neural networks, where new data is generated from the model to see what kind of patterns it has internalized in its parameters [(00:26:05)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=1565s). - To generate from the model, a prefix or starting token is fed into the network, which gives a probability vector, allowing for the sampling of a token based on the probability distribution [(00:26:26)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=1586s). - The tokens with high probability are more likely to be sampled, and this process can be thought of as flipping a biased coin [(00:26:52)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=1612s). - The sampled token is then appended to the sequence, and the process is repeated to generate the next token, with the model predicting from the distributions one at a time [(00:27:36)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=1656s). - This process is stochastic, meaning that the model is sampling and flipping coins, which can result in reproducing small chunks of the training data or generating new token streams that are very different from the training documents [(00:28:20)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=1700s). - The generated token streams will statistically have similar properties to the training data but are not identical, and can be thought of as “remixes” or “inspired by” the training data [(00:29:00)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=1740s). - The model’s predictions are based on the context window and the probability of certain tokens following others, which can result in the generation of new and unique sequences [(00:29:21)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=1761s). - Inference involves continuously feeding back tokens and getting the next one, with the model always “flipping coins” and depending on the luck of the sampling process [(00:29:35)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=1775s). - Large Language Models ([Large language model](https://en.wikipedia.org/wiki/Large_language_model)) like [ChatGPT](https://en.wikipedia.org/wiki/ChatGPT) can produce different patterns depending on how they sample from probability distributions, which is known as inference [(00:29:45)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=1785s). - In most common scenarios, a pre-processing step involves downloading the internet and tokenizing it, which is done only once, and then the token sequence is used to train networks [(00:29:51)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=1791s). - In practical cases, multiple networks with different settings, arrangements, and sizes are trained, and once a satisfactory set of parameters is obtained, the model can be used for inference [(00:30:06)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=1806s). - Inference involves generating data from the model using a trained set of parameters, which is what happens when interacting with a model on ChatGPT [(00:30:24)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=1824s). - The models on ChatGPT have been trained by [OpenAI](https://en.wikipedia.org/wiki/OpenAI) many months ago and have a fixed set of weights that work well, and when interacting with the model, it is only doing inference without any further training [(00:30:31)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=1831s). - The model on ChatGPT generates text by completing token sequences based on the input tokens provided to it [(00:30:51)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=1851s). - The model on [ChatGPT](https://en.wikipedia.org/wiki/ChatGPT) only performs inference and does not undergo any further training when in use [(00:30:57)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=1857s). ## [](#gpt-2%3A-training-and-inference-$00%3A31%3A09$)GPT-2: training and inference [(00:31:09)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=1869s) - GPT stands for Generatively Pre-trained [Transformer (deep learning architecture)](https://en.wikipedia.org/wiki/Transformer_$deep_learning_architecture$), and [GPT-2](https://en.wikipedia.org/wiki/GPT-2) is the second iteration of the GPT series by OpenAI, published in 2019. [(00:31:14)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=1874s) - GPT-2 was the first recognizably modern stack to come together, with all its components still recognizable by modern standards, although everything has gotten bigger since then. [(00:31:39)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=1899s) - GPT-2 was a Transformer neural network with 1.6 billion parameters, which is significantly smaller than modern Transformers that have closer to a trillion or several hundred billion parameters. [(00:32:06)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=1926s) - The maximum context length of GPT-2 was 1,024 tokens, which is tiny by modern standards, where context lengths can be closer to a couple hundred thousand or even a million tokens. [(00:32:21)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=1941s) - GPT-2 was trained on approximately 100 billion tokens, which is also fairly small by modern standards, with datasets like the Fine Web dataset having 15 trillion tokens. [(00:33:01)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=1981s) - The cost of training GPT-2 in 2019 was estimated to be approximately $40,000, but it can be done significantly cheaper today, with a reproduction attempt taking about one day and $600, and potentially even lower costs with more effort. [(00:33:33)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=2013s) - The cost reduction is due to improved data sets, better data filtering and preparation, faster hardware, and more efficient software for running these models. [(00:33:55)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=2035s) - The reproduction attempt of [GPT-2](https://en.wikipedia.org/wiki/GPT-2) was done as part of a project called [LM.C (Japanese band)](https://en.wikipedia.org/wiki/LM.C_$Japanese_band$), and the details can be found in a post on [GitHub](https://en.wikipedia.org/wiki/GitHub) under the LM.C repository. [(00:33:27)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=2007s) - A GPT2 model is being trained, and every line in the training process represents one update to the model, where the neural network’s parameters are changed to improve its prediction of the next token in a sequence [(00:34:52)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=2092s). - Each update improves the prediction on 1 million tokens in the training set, and the process is repeated for a total of 32,000 steps, resulting in the processing of approximately 33 billion tokens [(00:35:32)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=2132s). - The loss, a single number that measures the neural network’s performance, is being closely monitored, and a low loss indicates better performance [(00:35:55)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=2155s). - The training process involves processing 1 million tokens per update, with each update taking around 7 seconds, and the current progress is only about 1% complete after 420 steps [(00:36:36)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=2196s). - Every 20 steps, the model performs inference, predicting the next token in a sequence, and the results are still incoherent at this early stage of training [(00:37:09)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=2229s). - The model’s predictions are not yet coherent, but they show some local coherence, and the results improve slightly as the training progresses [(00:37:37)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=2257s). - At the beginning of the training process, the model’s predictions were completely random, but after 20 steps of optimization, the results started to show some structure [(00:38:10)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=2290s). - The model is improving over time, and after running for about a day or two, it starts generating fairly coherent [English language](https://en.wikipedia.org/wiki/English_language), with tokens streaming correctly, and making up English better [(00:38:44)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=2324s). - The model is not run on a laptop due to its large size and the need for a lot of data, making it too expensive to run on a personal computer [(00:39:16)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=2356s). - The model is run on a computer in the cloud, specifically an 8X H100 node, which is rented from a company like Lambda [(00:39:32)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=2372s). - The 8X H100 node is a single computer with eight [Graphics processing unit](https://en.wikipedia.org/wiki/Graphics_processing_unit), and renting it costs $3 per GPU per hour [(00:40:10)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=2410s). - H100 GPUs are designed for training neural networks and are a perfect fit due to their high computational expense and ability to display parallelism in computation [(00:40:34)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=2434s). - Multiple H100 GPUs can be stacked together to form a single node, and multiple nodes can be stacked to form an entire data center or system [(00:40:58)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=2458s). - The demand for H100 GPUs has driven the stock price of [Nvidia](https://en.wikipedia.org/wiki/Nvidia) to $3.4 trillion, with big tech companies desiring them to train language models [(00:41:37)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=2497s). - The computational workflow of training language models is extremely expensive, and the more GPUs available, the faster the process [(00:42:06)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=2526s). - The process of training large language models ([Large language model](https://en.wikipedia.org/wiki/Large_language_model)) like [ChatGPT](https://en.wikipedia.org/wiki/ChatGPT) involves predicting the next token in a sequence to improve the network, and having more tokens allows for faster processing, iteration, and training of a bigger network [(00:42:09)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=2529s). - The use of multiple machines, specifically GPUs, is crucial in this process, as they enable faster processing and iteration, making it a significant aspect of LLM development [(00:42:19)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=2539s). - The scale of resources required for LLM training is substantial, as exemplified by Elon Musk’s acquisition of 100,000 GPUs in a single data center, which are expensive and power-hungry, solely dedicated to predicting the next token in a sequence and improving the network [(00:42:31)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=2551s). - The ultimate goal of this process is to generate more coherent text at a faster rate, which is expected to be achieved with the continued advancement of LLMs [(00:42:49)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=2569s). ## [](#llama-3.1-base-model-inference-$00%3A42%3A52$)Llama 3.1 base model inference [(00:42:52)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=2572s) - Training a [Large language model](https://en.wikipedia.org/wiki/Large_language_model) (LLM) like [ChatGPT](https://en.wikipedia.org/wiki/ChatGPT) requires significant computational resources, which can be expensive, but big tech companies often train and release these models, making them available for use [(00:42:52)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=2572s). - A base model is a token simulator that creates remixes of internet text, but it is not useful on its own as it does not respond to questions or provide answers [(00:43:27)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=2607s). - Base models are not often released, but a few examples include the [GPT-2](https://en.wikipedia.org/wiki/GPT-2) model, which was released in 2019 with 1.5 billion parameters, and the [Llama (language model)](https://en.wikipedia.org/wiki/Llama_$language_model$) 3 model, which was released by [Meta Platforms](https://en.wikipedia.org/wiki/Meta_Platforms) with 45 billion parameters [(00:44:01)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=2641s). - To release a model, two things are needed: the [Python (programming language)](https://en.wikipedia.org/wiki/Python_$programming_language$) code that describes the sequence of operations in the model, and the parameters, which are the actual values that make up the model [(00:44:18)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=2658s). - The parameters of a model are the most valuable part, as they contain the precise settings that make the model work well, and they can be very large, such as the 1.5 billion parameters in the GPT2 model [(00:45:05)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=2705s). - The LLaMA 3 model is a much larger and more modern model than GPT2, with 45 billion parameters and trained on 15 trillion tokens, and it was released by Meta along with a paper that provides more details [(00:45:58)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=2758s). - In addition to the base model, Meta also released an instruct model, which is a version of the model that has been fine-tuned to respond to instructions and provide answers [(00:46:35)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=2795s). - A base model, such as the 405b Llama 3.1, can be interacted with on the Hyperbolic website, allowing users to input a prefix and generate a continuation of up to a specified number of tokens, in this case, 128 tokens [(00:47:09)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=2829s). - The base model is not yet an assistant and does not provide direct answers to questions, but instead acts as a token autocomplete, generating the next token in a sequence based on the statistics of its training data [(00:47:49)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=2869s). - The model’s output is stochastic, meaning that it starts from scratch each time it is given a prefix and generates a different answer due to sampling from a probability distribution [(00:48:58)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=2938s). - Despite not being useful as a standalone assistant, the base model has learned a significant amount about the world through its training data and has stored this knowledge in its parameters, which can be thought of as a compression of the internet and web pages it was trained on [(00:49:52)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=2992s). - The 405 billion parameters of the model can be seen as a kind of compression of the knowledge it has learned, allowing it to generate text based on the patterns and statistics it has discovered in its training data [(00:50:11)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=3011s). - Large language models ([Large language model](https://en.wikipedia.org/wiki/Large_language_model)) like [ChatGPT](https://en.wikipedia.org/wiki/ChatGPT) can be thought of as a compressed version of the internet, with 45 billion parameters, but it’s a lossy compression, meaning some information is lost, and we’re left with a vague recollection of the internet that can be used to generate new content [(00:50:16)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=3016s). - By prompting the base model, we can elicit some of the knowledge hidden in the parameters, but the information may not be fully trustworthy since it’s based on the model’s recollection of internet documents [(00:50:31)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=3031s). - The model’s knowledge is vague, probabilistic, and statistical, with frequently occurring information more likely to be remembered correctly [(00:51:17)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=3077s). - The model can generate lists of information, such as the top landmarks to see in [Paris](https://en.wikipedia.org/wiki/Paris), but the accuracy of the information cannot be guaranteed [(00:50:41)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=3041s). - The model can also memorize and regurgitate large chunks of text, such as [Wikipedia](https://en.wikipedia.org/wiki/Wikipedia) entries, which can be undesirable in the final model [(00:52:23)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=3143s). - This memorization occurs because the model is trained on high-quality sources like Wikipedia and may preferentially sample from these sources, leading to regurgitation of the original text [(00:53:31)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=3211s). - The model’s behavior is not always precise and exact, but rather approximate and probabilistic, with the goal of generating new content based on the patterns and relationships learned from the training data [(00:51:48)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=3108s). - Large language models ([Large language model](https://en.wikipedia.org/wiki/Large_language_model)) like [ChatGPT](https://en.wikipedia.org/wiki/ChatGPT) can efficiently memorize and recite text after seeing it multiple times, similar to how humans can recite text after reading it many times, but with greater efficiency [(00:53:50)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=3230s). - LLMs can generate text based on patterns and knowledge learned from their training data, even if they haven’t seen specific information before, such as events after their knowledge cutoff date [(00:54:21)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=3261s). - When given a prompt with information from the future, LLMs will continue the token sequence and make educated guesses based on their knowledge, creating a kind of “parallel universe” scenario [(00:54:43)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=3283s). - This process is called “hallucination,” where the model takes its best guess in a probabilistic manner [(00:55:49)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=3349s). - Even base models can be utilized in practical applications with clever prompt design, such as using a few-shot prompt to enable the model to learn and apply patterns in context [(00:56:04)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=3364s). - LLMs have in-context learning abilities, which allow them to recognize patterns in the input data and continue them, making them useful for tasks like translation [(00:56:31)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=3391s). - By constructing a few-shot prompt, it’s possible to build apps that leverage the model’s in-context learning ability, even with a base model [(00:57:14)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=3434s). - It’s also possible to instantiate a whole language model assistant just by prompting, using clever techniques to unlock the model’s capabilities [(00:57:15)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=3435s). - To create a prompt for a [Large language model](https://en.wikipedia.org/wiki/Large_language_model) (LLM) like [ChatGPT](https://en.wikipedia.org/wiki/ChatGPT), a conversation between a helpful AI assistant and a human can be structured to look like a web page, allowing the model to continue the conversation [(00:57:24)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=3444s). - The prompt was created using ChatGBT itself, which wrote a conversation between an AI assistant and a human, describing the AI assistant as knowledgeable, helpful, and capable of answering a wide variety of questions [(00:57:50)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=3470s). - The prompt includes a few turns of conversation between the human and the AI assistant, followed by the actual query, which in this case is “why is the sky blue” [(00:58:15)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=3495s). - By copying and pasting the prompt into the base model, the model continues the sequence and takes on the role of the AI assistant, providing an answer to the query [(00:58:21)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=3501s). - The base model’s response is based on the phenomenon of [Rayleigh scattering](https://en.wikipedia.org/wiki/Rayleigh_scattering), which causes the sky to appear blue [(00:58:37)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=3517s). - The model continues the conversation by hallucinating the next question from the human, allowing it to keep going on and on [(00:58:55)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=3535s). - This method can be used to create an assistant even with a base model, by structuring the prompt in a way that allows the model to take on the role of the AI assistant [(00:59:16)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=3556s). - Without this structured prompt, the base model would not be able to provide a coherent answer to the query, and would instead provide unrelated responses [(00:59:09)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=3549s). ## [](#pretraining-to-post-training-$00%3A59%3A23$)pretraining to post-training [(00:59:23)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=3563s) - The pre-training stage of training language model (LM) assistants like [ChatGPT](https://en.wikipedia.org/wiki/ChatGPT) involves taking Internet documents, breaking them into tokens, and predicting token sequences using neural networks, resulting in a base model that can simulate Internet documents on the token level [(00:59:34)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=3574s). - The base model can generate token sequences with the same statistics as Internet documents and can be used in some applications, but it needs to be improved to function as an assistant that can answer questions [(01:00:12)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=3612s). - The post-training stage is necessary to turn the base model into an assistant, and it involves handing off the base model to post-training, which is computationally less expensive than pre-training [(01:00:21)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=3621s). - The post-training stage is still extremely important, as it transforms the language model into an assistant, and it involves methods to get the model to give answers to questions rather than just sampling Internet documents [(01:00:52)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=3652s). - The pre-training stage is the most computationally intensive part of the process, requiring massive data centers and heavy compute, whereas the post-training stage is slightly cheaper but still crucial [(01:00:40)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=3640s). - The goal of the post-training stage is to fine-tune the model to perform specific tasks, such as answering questions, and to make it more useful as an assistant [(01:00:57)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=3657s). ## [](#post-training-data-$conversations$-$01%3A01%3A06$)post-training data (conversations) [(01:01:06)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=3666s) - Conversations with Large Language Models ([Large language model](https://en.wikipedia.org/wiki/Large_language_model)) like [ChatGPT](https://en.wikipedia.org/wiki/ChatGPT) are multi-turn, meaning they can have multiple turns, and are typically between a human and an assistant [(01:01:07)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=3667s). - These conversations can involve simple math, such as responding to “what is 2 plus 2” with “2 plus 2 is 4”, and can also involve more complex interactions, such as refusing to help with certain requests [(01:01:22)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=3682s). - The assistant can also have a personality and respond in a friendly manner [(01:01:37)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=3697s). - To program the assistant’s behavior in these conversations, data sets of conversations are created, which can include hundreds of thousands of conversations that cover a diverse range of topics [(01:02:21)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=3741s). - These data sets are created by human labelers who provide the ideal assistant response in a given conversational context [(01:02:51)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=3771s). - The model is then trained on these data sets to imitate the responses of human labelers [(01:03:08)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=3788s). - The base model is first trained on internet documents, and then this data set is replaced with a new data set of conversations, and the model continues to train on these conversations [(01:03:16)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=3796s). - The model rapidly adjusts to the new data set and learns the statistics of how the assistant responds to human queries [(01:03:35)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=3815s). - During inference, the assistant can be primed to respond to human queries in a way that imitates human labelers [(01:03:48)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=3828s). - The post-training stage, where the model is fine-tuned on the conversation data set, is typically much shorter than the pre-training stage, which can take around three months [(01:04:05)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=3845s). - The training data for a [Large language model](https://en.wikipedia.org/wiki/Large_language_model) (LLM) is much smaller than the dataset of text on the internet, making the training process relatively short [(01:04:23)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=3863s). - To train an LLM on conversations, the conversations need to be represented as token sequences, which requires designing an encoding system with rules and protocols for structuring the data [(01:05:01)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=3901s). - The encoding system for conversations involves turning the conversation into a one-dimensional sequence of tokens, with different LLMs having slightly different formats or protocols [(01:06:11)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=3971s). - For example, [GPT-4o](https://en.wikipedia.org/wiki/GPT-4o) uses a special token called “IM start” to indicate the start of an imaginary monologue, followed by a token specifying whose turn it is, and then the tokens of the question or message [(01:06:36)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=3996s). - The “IM start” token is not text and has never been trained on before, but is introduced in a post-training stage to help the model learn to recognize the start of a turn and who is speaking [(01:07:20)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=4040s). - Special tokens like “IM start” and “IM end” are interspersed with text to help the model learn to recognize the structure of conversations and who is speaking [(01:07:34)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=4054s). - The conversation is represented as a sequence of tokens, with the user’s question or message being followed by a token indicating the end of the imaginary monologue [(01:07:09)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=4029s). - The token sequence for a conversation can be relatively long, with a simple conversation between a user and an assistant resulting in a sequence of 49 tokens [(01:06:19)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=3979s). - Conversations are turned into a sequence of tokens, which is a one-dimensional sequence that can be used to train a language model, allowing for the prediction of the next token in a sequence [(01:08:16)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=4096s). - During inference, a model trained on conversations can be used to generate responses to user input, with the model sampling from possible tokens to create a response [(01:09:27)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=4167s). - The protocol for generating responses involves constructing a context and then using the language model to sample from possible tokens to create a response [(01:09:24)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=4164s). - The details of the protocol are not important, but rather the fact that conversations can be turned into a one-dimensional sequence of tokens that can be used to train a language model [(01:09:52)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=4192s). - The paper “InstructGPT” by [OpenAI](https://en.wikipedia.org/wiki/OpenAI) in 2022 was the first to discuss fine-tuning language models on conversations, and it describes the use of human contractors to construct conversations [(01:10:20)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=4220s). - Human labelers were hired to create prompts and ideal assistant responses, which were used to train the language model [(01:10:47)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=4247s). - The prompts created by human labelers include a wide range of topics, such as regaining enthusiasm for a career, science fiction book recommendations, and language translation [(01:11:06)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=4266s). - The process of creating ideal assistant responses for language models like [ChatGPT](https://en.wikipedia.org/wiki/ChatGPT) involves human labelers who are given labeling instructions by the company developing the model, such as Open AI, to create helpful, truthful, and harmless responses [(01:11:56)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=4316s). - These labeling instructions are usually extensive, with hundreds of pages, and human labelers must study them professionally to write ideal assistant responses [(01:12:18)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=4338s). - The data set for InstructGPT was not released by OpenAI, but open-source reproductions have been created to follow a similar setup and collect data [(01:12:32)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=4352s). - One example of an open-source reproduction is the effort of Open Assistant, where people on the internet were asked to create conversations similar to those created by [OpenAI](https://en.wikipedia.org/wiki/OpenAI) with human labelers [(01:12:45)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=4365s). - The conversations created by human labelers or internet users consist of a prompt, an ideal assistant response, and potentially a continuation of the conversation [(01:13:17)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=4397s). - The goal of training a language model on these conversations is to create a model that takes on the persona of a helpful, truthful, and harmless assistant, programmed by example [(01:14:06)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=4446s). - During training, the model starts to understand the statistical pattern of the example behaviors and takes on the personality of the assistant, allowing it to respond to new prompts in a similar way [(01:14:22)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=4462s). - While it’s possible for the model to recite an exact answer from the training set, it’s more likely to create a response that captures the vibe of the desired answer [(01:14:38)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=4478s). - Large language models ([Large language model](https://en.wikipedia.org/wiki/Large_language_model)) are programmed by example, adopting a statistically generated persona of a helpful, truthful, and harmless assistant, as reflected in the labeling instructions created by companies [(01:14:52)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=4492s). - The state-of-the-art in LLMs has advanced in the last 2-3 years, with humans no longer doing all the heavy lifting alone, and LLMs helping create data sets and conversations [(01:15:09)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=4509s). - It is rare for people to write responses from scratch; instead, they often use existing LLMs to generate answers and then edit them [(01:15:23)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=4523s). - LLMs have started to permeate the post-training set stack, being used pervasively to help create massive data sets of conversations [(01:15:35)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=4535s). - Modern data sets of conversations, such as Ultra Chat, are often synthetic but may have some human involvement, with a huge amount of synthetic help and a little bit of human editing [(01:15:49)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=4549s). - These data sets have millions of conversations, mostly synthetic but edited by humans, spanning a huge diversity of areas [(01:16:14)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=4574s). - The conversations in these data sets are often a mixture of different types and sources, partially synthetic and partially human [(01:16:33)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=4593s). - Despite advancements, the core concept remains the same: training on conversations in data sets [(01:16:48)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=4608s). - When interacting with AI models like [ChatGPT](https://en.wikipedia.org/wiki/ChatGPT), the responses are statistically aligned with the training set, which is based on human labelers following labeling instructions [(01:17:00)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=4620s). - The responses from ChatGPT can be thought of as a simulation of a human labeler, who is often an educated expert in a specific field [(01:17:40)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=4660s). - The human labelers involved in creating conversation data sets are usually hired experts, and the AI model’s responses can be seen as a simulation of their answers [(01:18:01)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=4681s). ## [](#hallucinations%2C-tool-use%2C-knowledge%2Fworking-memory-$01%3A20%3A32$)hallucinations, tool use, knowledge/working memory [(01:20:32)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=4832s) - Large Language Models ([Large language model](https://en.wikipedia.org/wiki/Large_language_model)) can exhibit emergent cognitive effects due to their training pipeline, including hallucinations, where they fabricate information or make things up, which is a significant problem with LLM assistants [(01:20:33)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=4833s). - Hallucinations in LLMs occur when the model statistically imitates its training set, where questions of the form “who is blah” are confidently answered with the correct answer, leading the model to take on the style of the answer and make stuff up [(01:22:40)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=4960s). - The training set for LLMs often includes conversations where humans provide confident and correct answers to questions, which can lead to the model adopting a similar tone and style, even when it’s unsure or doesn’t know the answer [(01:21:46)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=4906s). - When asked about a random or non-existent person, the LLM may not say “I don’t know” even if it’s unfamiliar with the person, instead providing a statistically likely guess or making something up [(01:22:09)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=4929s). - The Falcon 7B model, an older LLM, suffers from hallucinations and can provide false or made-up information when asked about a person, such as Orson Kovats [(01:23:20)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=5000s). - The model’s responses can vary when resampled, providing different and often incorrect answers, highlighting the statistical nature of LLMs [(01:23:46)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=5026s). - [Large language model](https://en.wikipedia.org/wiki/Large_language_model) do not have access to the internet and do not conduct research, instead relying on statistical token prediction to generate responses [(01:23:01)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=4981s). - Large language models (LLMs) like [ChatGPT](https://en.wikipedia.org/wiki/ChatGPT) can provide different answers to the same question because they don’t actually know the answer, but instead sample from probabilities and imitate the format of the answer in their training set [(01:24:02)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=5042s). - This can result in statistically consistent but made-up factual knowledge, and the model is not going to look up the correct answer because it’s just imitating the format of the answer [(01:24:36)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=5076s). - To mitigate this issue, it’s necessary to have examples in the training set where the correct answer is that the model doesn’t know something, but only in cases where the model actually doesn’t know [(01:25:47)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=5147s). - This can be achieved by empirically probing the model to figure out what it knows and doesn’t know, and then adding examples to the training set where the correct answer is that the model doesn’t know [(01:26:07)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=5167s). - Meta’s approach to dealing with hallucinations in their [Llama (language model)](https://en.wikipedia.org/wiki/Llama_$language_model$) 3 series of models involved interrogating the model to figure out the boundary of its knowledge and adding examples to the training set where the correct answer is that the model doesn’t know [(01:26:18)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=5178s). - This approach roughly fixes the issue because the model might have a good model of its self-knowledge inside the network, but the activation of the relevant neurons is not currently wired up to the model actually saying that it doesn’t know [(01:26:54)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=5214s). - By allowing the model to say “I don’t know” in cases where it’s uncertain, it’s possible to improve the accuracy of the model’s responses and reduce the occurrence of hallucinations [(01:27:37)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=5257s). - Meta’s process involves taking a random document from a training set, selecting a paragraph, and using a [Large language model](https://en.wikipedia.org/wiki/Large_language_model) (LLM) to construct questions about that paragraph, which can be done with fairly high accuracy if the information is within the LLM’s context window [(01:27:41)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=5261s). - The generated questions and answers are then used to interrogate the model, which can be done programmatically by comparing the model’s answer to the correct answer, and if the model’s answer is correct, it likely knows the information [(01:28:50)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=5330s). - If the model’s answer is incorrect, it is considered to be “making stuff up,” and this can be determined by interrogating the model multiple times and comparing its answers to the correct answer [(01:30:17)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=5417s). - When the model is found to not know the answer to a question, a new conversation is added to the training set with the correct answer being “I’m sorry, I don’t know” or “I don’t remember,” which gives the model the opportunity to refuse to answer based on its knowledge [(01:30:54)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=5454s). - This process can be repeated for many different types of questions and documents, allowing the model to learn when to say “I don’t know” and improving its overall performance [(01:31:15)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=5475s). - Large language models (LLMs) like [ChatGPT](https://en.wikipedia.org/wiki/ChatGPT) can learn to associate knowledge-based refusal with internal neurons in their network, allowing them to say they don’t know something when their uncertainty is high, mitigating hallucination issues if examples of this are present in the training set [(01:31:30)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=5490s). - Instead of just saying they don’t know, [Large language model](https://en.wikipedia.org/wiki/Large_language_model) can be given the opportunity to be factual and answer questions by introducing an additional mitigation that allows them to search for information, similar to how humans would use the internet to find answers [(01:32:21)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=5541s). - The knowledge inside an LLM’s neural network can be thought of as a vague recollection of things seen during training, and just like humans, LLMs can “refresh their memory” by looking up information [(01:32:54)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=5574s). - To achieve this, tools can be introduced that allow LLMs to emit special tokens, such as “search start” and “search end”, which can trigger a search query to be sent to a search engine like [Microsoft Bing](https://en.wikipedia.org/wiki/Microsoft_Bing) or [Google](https://en.wikipedia.org/wiki/Google) [(01:33:42)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=5622s). - When the LLM emits the “search end” token, the program running the inference can pause generating from the model, [OpenAI](https://en.wikipedia.org/wiki/OpenAI) a session with the search engine, paste the search query, and retrieve the relevant text [(01:34:30)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=5670s). - The retrieved text can then be represented with special tokens and used to provide a more accurate answer to the original question [(01:34:54)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=5694s). - The context window can be thought of as the working memory of the model, where data is directly accessible and feeds into the neural network, allowing the model to reference and utilize the data easily when sampling new tokens [(01:35:21)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=5721s). - The model uses tools like web search by introducing new tokens and schema, which are learned through training sets that show the model how to use these tools by example [(01:36:02)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=5762s). - To teach the model how to correctly use tools like web search, a large number of data and conversations are needed to demonstrate how to use the tool, including settings, starting and ending searches, and structuring queries [(01:36:14)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=5774s). - The model’s pre-training data set and understanding of the world enable it to understand what a web search is and what makes a good search query, allowing it to learn how to use the tool with just a few examples [(01:36:47)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=5807s). - When the model retrieves information using a tool like web search, it puts the information in the context window, making it easily accessible and manipulable, similar to how humans look up information [(01:37:07)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=5827s). - The model may decide to use a tool like web search when it encounters a rare or unknown individual or topic, sampling a special token to perform the search and retrieve relevant information [(01:37:23)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=5843s). - The model can then reference and cite the sources it found during the web search, creating a more informed and accurate response [(01:37:45)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=5865s). - The model’s ability to use tools like web search allows it to provide more accurate and up-to-date information, even if it doesn’t have the information in its initial training data [(01:38:15)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=5895s). ## [](#knowledge-of-self-$01%3A41%3A46$)knowledge of self [(01:41:46)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=6106s) - Large Language Models ([Large language model](https://en.wikipedia.org/wiki/Large_language_model)) like [ChatGPT](https://en.wikipedia.org/wiki/ChatGPT) do not have a persistent existence or sense of self, as they boot up, process tokens, and shut off for every single conversation, making them restart from scratch each time [(01:41:47)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=6107s). - Asking an LLM questions like “what model are you” or “who built you” can be nonsensical, as they do not have a persistent self and will provide random answers based on their statistical best guess [(01:41:54)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=6114s). - For example, the Falcon model, when asked about its origins, provided a made-up answer, stating it was built by [OpenAI](https://en.wikipedia.org/wiki/OpenAI) based on the [GPT-3](https://en.wikipedia.org/wiki/GPT-3) model, which is likely a result of its statistical best guess rather than actual knowledge [(01:42:46)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=6166s). - LLMs can take on a persona or identity based on their training data, but this is not a true sense of self, and they can be overridden by developers to provide more accurate information [(01:43:47)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=6227s). - The pre-training stage of LLMs involves training on a large corpus of text data, including documents from the entire internet, which can influence their responses to questions about their identity [(01:43:53)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=6233s). - Developers can override an LLM’s self-identity by providing explicit programming or training data that defines its label or persona [(01:44:23)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=6263s). - The [Alamo Mission](https://en.wikipedia.org/wiki/Alamo_Mission) model from [Allen Institute for AI](https://en.wikipedia.org/wiki/Allen_Institute_for_AI) is an example of an [Large language model](https://en.wikipedia.org/wiki/Large_language_model) that has been fully open-sourced, allowing for transparency into its training data and development [(01:44:31)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=6271s). - The Almo model’s training data includes a mixture of conversations, with a small portion of hardcoded conversations that define its persona and identity [(01:44:54)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=6294s). - The assistant’s response to the user’s question “about yourself” is a pre-programmed answer, stating it’s an [OpenAI](https://en.wikipedia.org/wiki/OpenAI) language model developed by the AI Allen Institute of artificial intelligence, and its purpose is to help. [(01:45:12)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=6312s) - The assistant’s responses to questions about itself are based on hardcoded questions and answers in its training set, and fine-tuning the model with these conversations enables it to parrot this information later. [(01:45:23)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=6323s) - If the model is not given this information, it is likely a ChaChat by Open AI, and there is another way to program the model to talk about itself through a system message at the beginning of the conversation. [(01:45:43)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=6343s) - The system message is a special message that can hardcode and remind the model of its identity, including its name, training data, and knowledge cutoff, and this information is inserted into conversations. [(01:46:03)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=6363s) - When interacting with [ChatGPT](https://en.wikipedia.org/wiki/ChatGPT), a blank page is displayed, but the system message is hidden and its tokens are in the context window, reminding the model of its identity. [(01:46:23)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=6383s) - There are two ways to program the model to talk about itself: through data like hardcoded questions and answers, or through system messages and invisible tokens in the context window. [(01:46:30)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=6390s) - The model’s ability to talk about itself is not deeply ingrained but rather “cooked up and bolted on” in some way, unlike human identity. [(01:46:47)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=6407s) ## [](#models-need-tokens-to-think-$01%3A46%3A56$)models need tokens to think [(01:46:56)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=6416s) - When constructing examples of conversations for training large language models ([Large language model](https://en.wikipedia.org/wiki/Large_language_model)), it’s crucial to be careful, as there are sharp edges that can be elucidative of how these models think [(01:46:57)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=6417s). - A simple math problem prompt, such as “Emily buys three apples and two oranges, each orange costs $2, the total cost is 13, what is the cost of apples?” can have two correct answers, but one is significantly better for the assistant than the other [(01:47:32)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=6452s). - The key to this question is realizing that LLMs work in a one-dimensional sequence of tokens from left to right, and when training and inferencing, they are working with this sequence to produce the next token [(01:48:32)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=6512s). - The neural network used in LLMs calculates the probabilities for the next token in a sequence by feeding all previous tokens into the network, which then outputs the probabilities for the next token [(01:48:46)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=6526s). - The number of layers of computation in LLMs is finite, typically around 100 layers, and this amount of computation is roughly fixed for every single token in the sequence, making it a small amount of computation [(01:49:37)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=6577s). - Although the amount of computation is not fully fixed, as the more tokens fed in, the more expensive the forward pass of the neural network will be, it’s still a relatively small increase [(01:50:01)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=6601s). - Large language models ([Large language model](https://en.wikipedia.org/wiki/Large_language_model)) like [ChatGPT](https://en.wikipedia.org/wiki/ChatGPT) have a finite amount of computation per token, which means they cannot perform arbitrary computation in a single forward pass to get a single token [(01:50:27)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=6627s). - This limitation requires distributing reasoning and computation across many tokens, as every single token is only spending a finite amount of computation on it [(01:50:37)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=6637s). - Having too much computation or expecting too much computation out of the model in any single individual token is not possible because there’s only so much computation that happens per token [(01:50:55)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=6655s). - When answering questions directly and immediately, the model is trained to try to guess the answer in a single token, which is not effective due to the finite amount of computation per token [(01:52:07)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=6727s). - Distributing computation across the answer, by getting intermediate results and creating intermediate calculations, allows the model to slowly come to the answer from left to right [(01:52:19)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=6739s). - This approach teaches the model to spread out its reasoning and computation over the tokens, making it easier for the model to determine the answer by the time it’s near the end [(01:52:59)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=6779s). - Labeling answers in a way that requires the model to do all the computation in a single token is not effective and can be considered a bad label for computation [(01:53:18)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=6798s). - Typically, users don’t have to think about this explicitly, as the people at [OpenAI](https://en.wikipedia.org/wiki/OpenAI) have labelers who worry about this and make sure that the answers are properly labeled [(01:53:38)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=6818s). - When asking [ChatGPT](https://en.wikipedia.org/wiki/ChatGPT) a question, it goes through a process of defining variables, setting up equations, and creating intermediate results, which are not visible to the user but are necessary for the model to reach the correct answer [(01:53:41)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=6821s). - The model can be asked to answer a question in a single token, but this may not always be possible, especially if the numbers involved are large or the calculation is complex [(01:54:10)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=6850s). - If the model is asked to perform a calculation in a single token and fails, it will produce an incorrect answer, but if it is allowed to perform the calculation in multiple steps, it can produce the correct answer [(01:54:53)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=6893s). - The model’s ability to perform calculations in a single token is limited, and it may not be able to perform complex calculations in a single forward pass of the network [(01:55:02)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=6902s). - If the model is allowed to perform calculations in multiple steps, it can produce the correct answer, and each intermediate result is much easier for the model to calculate [(01:55:16)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=6916s). - The model’s calculations can be verified by using code, which is one of the tools that ChatGPT can use, and this can be more reliable than relying on the model’s mental arithmetic [(01:56:02)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=6962s). - Using code can be more trustworthy than relying on the model’s mental arithmetic, especially when dealing with large numbers or complex calculations [(01:56:11)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=6971s). - The model’s ability to perform mental arithmetic is impressive, but it is not always reliable, and using tools like code can provide a more accurate result [(01:56:31)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=6991s). - Large language models ([Large language model](https://en.wikipedia.org/wiki/Large_language_model)) like [ChatGPT](https://en.wikipedia.org/wiki/ChatGPT) can write code and use a [Python (programming language)](https://en.wikipedia.org/wiki/Python_$programming_language$) interpreter to calculate results, which is more reliable than the model’s mental arithmetic, as it has more correctness guarantees [(01:56:53)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=7013s). - LLMs have special tokens for calling code interpreters, which allows them to write a program, send it to a different part of the computer to run, and then access the result [(01:57:30)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=7050s). - Using code interpreters is a more error-prone approach, and it’s recommended to use tools instead of relying on the model’s memory to perform tasks [(01:57:57)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=7077s). - LLMs are not good at counting because they try to solve the problem in a single token, which has limited computation power [(01:58:26)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=7106s). - Breaking down problems into smaller, easier tasks can help LLMs perform better, as seen in the example of counting dots, where using code can provide a more accurate result [(01:59:55)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=7195s). - LLMs can be asked to create intermediate results or use tools to perform tasks, which can improve their performance and reduce errors [(01:58:08)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=7088s). - Models need tokens to think, and distributing competition across many tokens or using tools can help improve their performance [(01:58:01)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=7081s). - Large Language Models ([Large language model](https://en.wikipedia.org/wiki/Large_language_model)) like [ChatGPT](https://en.wikipedia.org/wiki/ChatGPT) are proficient in tasks that involve copying and pasting, as they can easily replicate token IDs and unpack them into the desired format [(02:00:14)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=7214s). - When given a task that requires counting, LLMs can create a string in a programming language like [Python (programming language)](https://en.wikipedia.org/wiki/Python_$programming_language$), which then calls a routine to perform the actual counting, rather than relying on the model’s mental arithmetic [(02:00:20)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=7220s). - The model’s ability to perform tasks is heavily reliant on tokens, and it is not capable of complex mental arithmetic, which is why it is not very good at counting tasks [(02:00:57)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=7257s). - For tasks that require counting, it is recommended to ask the model to use a tool or programming language to perform the task, rather than relying on the model’s mental arithmetic [(02:01:07)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=7267s). ## [](#tokenization-revisited%3A-models-struggle-with-spelling-$02%3A01%3A11$)tokenization revisited: models struggle with spelling [(02:01:11)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=7271s) - Large language models ([Large language model](https://en.wikipedia.org/wiki/Large_language_model)) have cognitive deficits, including struggling with spelling-related tasks, due to their token-based processing, which can lead to character-level tasks failing [(02:01:18)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=7278s). - LLMs see tokens, not individual characters, making tasks like printing every third character in a string challenging, as demonstrated with the string “ubiquitous” [(02:01:45)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=7305s). - The tokenization process can lead to issues with tasks that require character-level manipulation, and using tools like code can help overcome these limitations [(02:03:09)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=7389s). - LLMs may struggle with simple tasks like counting the number of Rs in the word “strawberry,” which went viral as an example of their limitations [(02:03:44)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=7424s). - The combination of LLMs not seeing individual characters and struggling with counting can lead to difficulties with tasks that require both, such as counting Rs in “strawberry” [(02:04:22)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=7462s). - While LLMs have improved, they are still not very good at spelling tasks, and users should be aware of these limitations when using the models in practice [(02:04:41)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=7481s). ## [](#jagged-intelligence-$02%3A04%3A53$)jagged intelligence [(02:04:53)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=7493s) - Large language models ([Large language model](https://en.wikipedia.org/wiki/Large_language_model)) have jagged edges, meaning they can struggle with certain tasks despite their capabilities, and some of these struggles may not make sense even to those who understand how the models work [(02:04:54)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=7494s). - LLMs can solve complex math problems and answer PhD-grade physics, chemistry, and biology questions, but sometimes struggle with very simple problems [(02:05:16)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=7516s). - An example of this is when an LLM was asked if 9.11 is bigger than 9.9, and it justified its answer but later flipped its decision, showing inconsistency in its responses [(02:05:31)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=7531s). - This inconsistency can occur even when asking the same question multiple times, with the model sometimes getting it right and sometimes getting it wrong [(02:06:01)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=7561s). - Research has shown that when scrutinizing the activations inside the neural network, a bunch of neurons associated with [Bible](https://en.wikipedia.org/wiki/Bible) verses light up, which may distract the model and lead to incorrect answers [(02:06:24)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=7584s). - In this case, the model may be reminded that in a Bible verse setting, 9.11 would come after 9.9, leading to a cognitively distracting association that affects its answer [(02:06:42)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=7602s). - This issue is not fully understood and is an example of the jagged edges of [Large language model](https://en.wikipedia.org/wiki/Large_language_model), highlighting the need to treat these models as stochastic systems that are magical but not fully trustworthy [(02:07:08)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=7628s). - LLMs should be used as tools, not relied upon to provide definitive answers, and their results should not be simply copied and pasted without verification [(02:07:25)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=7645s). ## [](#supervised-finetuning-to-reinforcement-learning-$02%3A07%3A28$)supervised finetuning to reinforcement learning [(02:07:28)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=7648s) - Large language models undergo two major stages of training: pre-training and post-training, with the latter including supervised fine-tuning and reinforcement learning [(02:07:29)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=7649s). - In the pre-training stage, the model is trained on internet documents, resulting in a base model that simulates internet documents but is not directly useful for tasks like answering questions [(02:07:44)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=7664s). - To create an assistant, the model undergoes supervised fine-tuning, where it is trained on a curated dataset of conversations between a human and an assistant, created by humans with the help of tools [(02:08:30)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=7710s). - The conversations in the dataset are created by humans who write prompts and ideal responses based on labeling documentations, with some help from language models [(02:08:44)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=7724s). - The fine-tuned model can hallucinate if not mitigated, and can be improved by leaning on tools like web search or code interpreters [(02:09:26)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=7766s). - Reinforcement learning is the last major stage of the pipeline, considered part of post-training, and is a different way of training language models that usually follows as the third step [(02:10:06)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=7806s). - Inside companies like [OpenAI](https://en.wikipedia.org/wiki/OpenAI), there are separate teams for each stage of the pipeline, with a handoff of the models from one team to the next [(02:10:26)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=7826s). - The major flow of training large language models involves pre-training, imitation of experts, and reinforcement learning, with the latter being the focus of discussion [(02:10:55)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=7855s). - Reinforcement learning is motivated by the idea of taking large language models through a process similar to going to school, where they acquire knowledge and skills through various paradigms [(02:11:11)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=7871s). - Textbooks in school typically contain three classes of information: exposition, problems with worked solutions, and practice problems [(02:11:36)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=7896s). - Exposition in textbooks is equivalent to pre-training, where the model builds a knowledge base and gets a sense of the topic [(02:12:08)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=7928s). - Problems with worked solutions in textbooks are equivalent to imitation of experts, where the model trains on expert data and tries to imitate the expert’s response [(02:12:32)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=7952s). - Practice problems in textbooks are critical for learning, as they require the model to practice and discover ways of solving problems on its own, using background information from pre-training and maybe some guidance from expert data [(02:13:30)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=8010s). - In practice problems, the model is given a problem description and the final answer, but not the solution, and it must try out different approaches to find the best solution [(02:13:44)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=8024s). - The process of solving practice problems relies on the background information from pre-training and maybe some guidance from expert data [(02:14:11)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=8051s). - The concept of imitation of human experts is briefly mentioned, suggesting that similar solutions can be attempted, and this idea is built upon in the following section [(02:14:17)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=8057s). - The current section focuses on practicing and trying out different solutions, rather than relying on expert solutions [(02:14:36)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=8076s). - This practice-based approach is related to reinforcement learning, which involves experimenting and learning from trial and error [(02:14:40)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=8080s). ## [](#reinforcement-learning-$02%3A14%3A42$)reinforcement learning [(02:14:42)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=8082s) - The problem being discussed is about Emily buying three apples and two oranges, with each orange costing $2, and the total cost of all the fruit being $13, and the goal is to find the cost of each apple, with four possible candidate solutions that all reach the answer three [(02:14:43)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=8083s). - The solutions can be presented in different ways, such as setting up a system of equations, talking through it in [English language](https://en.wikipedia.org/wiki/English_language), or skipping right to the solution, and the human data labeler may not know which conversation to add to the training set [(02:15:31)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=8131s). - The primary purpose of a solution is to reach the right answer, but there is also a secondary purpose of presenting the solution in a nice way for the human, with intermediate steps and a clear presentation [(02:16:12)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=8172s). - If the only goal is to reach the final answer, it is unclear which of the solutions is the best or most optimal for the [Large language model](https://en.wikipedia.org/wiki/Large_language_model) to use, and the human labeler may not know which one is best [(02:16:42)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=8202s). - The LLM can only spend a finite amount of compute on each token, and making too big of a leap in any one token can lead to mistakes, so it may be better to spread out the calculations or set up the problem as an equation [(02:17:11)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=8231s). - The example of the token sequence “30 - 4 IDE 3 equals” is given as a potentially bad example to give to the LLM, as it incentivizes the model to skip through the calculations quickly and make mistakes [(02:17:37)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=8257s). - The fundamental issue with Large Language Models (LLMs) is that their cognition is different from human cognition, making it challenging to annotate examples for them, as what is easy or hard for humans may not be the same for LLMs [(02:18:04)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=8284s). - LLMs have a vast amount of knowledge, including PhD-level knowledge in math, physics, and chemistry, which may not be utilized in problem-solving, and conversely, human solutions may inject knowledge that the LLM does not possess [(02:19:11)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=8351s). - The goal is to find the most economical token sequence that reaches the final solution, but humans are not in a good position to create these sequences, and instead, the LLM should discover them through reinforcement learning and trial and error [(02:19:45)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=8385s). - Reinforcement learning involves trying many different solutions and seeing which ones work well or not, and in the process, the [Large language model](https://en.wikipedia.org/wiki/Large_language_model) will discover the token sequences that reliably get to the answer given the prompt [(02:20:19)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=8419s). - The process of reinforcement learning is demonstrated using the Gemma 2 2 billion parameter model, which is a tiny model, but sufficient for illustration purposes [(02:20:31)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=8431s). - The model generates different solutions to a problem, and the correct answer is inspected, with the goal of finding the most effective solution through multiple attempts [(02:20:51)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=8451s). - Each attempt at a solution is different due to the stochastic nature of the model, which samples from a probability distribution at every token, resulting in slightly different paths [(02:21:23)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=8483s). - Large language models (LLMs) like [ChatGPT](https://en.wikipedia.org/wiki/ChatGPT) can generate multiple independent solutions for a single prompt, with some being correct and others not, by sampling thousands or millions of solutions in parallel [(02:21:44)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=8504s). - The goal is to encourage the solutions that lead to correct answers, and this can be visualized using a cartoon diagram showing a prompt with multiple solutions, some of which reach the correct answer (green) and others that do not (red) [(02:22:06)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=8526s). - In a hypothetical scenario, if only four out of 15 generated solutions are correct, the model will train on these correct solutions to encourage similar behavior in the future [(02:22:52)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=8572s). - The training sequences are not provided by expert human annotators, but rather by the model itself, allowing it to practice and learn from its own solutions [(02:23:36)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=8616s). - The model can be thought of as a student looking at its own solutions and deciding which ones work well, and then training on those solutions to improve its performance [(02:23:45)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=8625s). - The methodology can be tweaked in various ways, but a simple approach is to take the single best solution and train on it, making the model more likely to take that path in similar settings in the future [(02:24:02)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=8642s). - This process is repeated across many diverse prompts and thousands of solutions, allowing the model to discover for itself what kinds of token sequences lead to correct answers [(02:24:50)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=8690s). - The model is essentially “playing” and learning in a playground, trying to achieve its goals without relying on human annotators [(02:25:05)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=8705s). - The process of discovering sequences that work for a model involves reinforcement learning, which is a guess and check method where many different types of solutions are tried, and the ones that work are used more in the future [(02:25:12)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=8712s). - Supervised fine-tuning (SFT) models are helpful in initializing the model to the vicinity of correct solutions, but reinforcement learning is where the model really discovers the solutions that work and gets dialed in [(02:25:40)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=8740s). - The SFT model initializes the model by getting it to take solutions, set up systems of equations, and talk through solutions, but it does so blindly and statistically mimics expert behavior [(02:25:45)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=8745s). - Reinforcement learning is where the model discovers the solutions that work, gets the right answers, and is encouraged to improve over time [(02:26:10)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=8770s). - The high-level process for training large language models is similar to how children are trained, with the main difference being that children go through chapters of books and do training exercises, while AI models are trained stage by stage [(02:26:21)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=8781s). - The training process for AI models involves pre-training, which is equivalent to reading all the expository material and building a knowledge base, followed by the SFT stage, which involves looking at worked solutions from human experts [(02:26:45)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=8805s). - The final stage is the reinforcement learning (RL) stage, where the model does practice problems across all textbooks and gets the RL model [(02:27:26)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=8846s). - The overall process of training large language models is equivalent to the process used for training children, with the main difference being the stage-by-stage approach used for AI models [(02:27:40)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=8860s). ## [](#deepseek-r1-$02%3A27%3A47$)DeepSeek-R1 [(02:27:47)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=8867s) - The pre-training and fine-tuning stages of large language models ([Large language model](https://en.wikipedia.org/wiki/Large_language_model)) have been around for years and are standard practices among different LLM providers, but the RL training stage is still in its early development and not yet standardized [(02:27:55)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=8875s). - The high-level idea of RL training is trial and error, but there are many details and mathematical nuances to consider, such as how to pick the best solutions, how much to train on them, and what prompt distribution to use [(02:28:15)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=8895s). - Many companies, including [OpenAI](https://en.wikipedia.org/wiki/OpenAI) and other LLM providers, have experimented with reinforcement learning fine-tuning for LLMs internally, but have not publicly discussed it until recently [(02:28:41)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=8921s). - A paper from [DeepSeek](https://en.wikipedia.org/wiki/DeepSeek), a company in [China](https://en.wikipedia.org/wiki/China), recently discussed reinforcement learning fine-tuning for large language models and its importance in bringing out reasoning capabilities in the models [(02:28:52)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=8932s). - The paper reinvigorated public interest in using RL for LLMs and provided details needed to reproduce their results and get the stage to work for large language models [(02:29:21)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=8961s). - When RL is correctly applied to language models, it can improve their ability to solve mathematical problems, with accuracy continuing to climb as the model is updated with thousands of steps [(02:30:09)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=9009s). - The models are not only improving quantitatively but also qualitatively, discovering how to solve math problems and using more tokens to get higher accuracy results, leading to longer solutions [(02:30:46)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=9046s). - The model’s solutions are not only more accurate but also more detailed, with the model learning to create very long solutions to math problems [(02:30:56)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=9056s). - Large language models ([Large language model](https://en.wikipedia.org/wiki/Large_language_model)) like [ChatGPT](https://en.wikipedia.org/wiki/ChatGPT) have the ability to re-evaluate steps and try out different perspectives to improve accuracy in problem-solving, which is an emergent property of new optimization techniques [(02:31:11)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=9071s). - This process involves the model retracing, reframing, and backtracking to identify the correct solution, similar to how humans approach problem-solving, but it’s not something that can be hardcoded by humans [(02:31:37)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=9097s). - The model learns to use “chains of thought” and discovers cognitive strategies for manipulating problems and approaching them from different perspectives, which is a result of reinforcement learning [(02:32:04)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=9124s). - This process can lead to longer response lengths, but it also increases the accuracy of problem-solving, and it’s incredible to see this emerge in the optimization without having to hardcode it anywhere [(02:32:10)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=9130s). - The model is able to discover ways to think and learn cognitive strategies, such as trying out different things, checking results from different perspectives, and solving problems, all of which are discovered by reinforcement learning [(02:32:22)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=9142s). - A reasoning or thinking model, such as the one described in the paper DC car1, can be used to solve problems and provide more detailed and thought-out responses, and this model is available on [Online chat](https://en.wikipedia.org/wiki/Online_chat) [(02:33:25)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=9205s). - When given a problem, the reasoning model can provide a more step-by-step and thought-out response, showing the model’s ability to think and reason through the problem [(02:33:44)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=9224s). - The model’s response shows that it is actively thinking and pursuing a solution, and it’s able to check its math and provide a more detailed and accurate response [(02:34:11)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=9251s). - Large Language Models ([Large language model](https://en.wikipedia.org/wiki/Large_language_model)) like [ChatGPT](https://en.wikipedia.org/wiki/ChatGPT) use a thinking process to solve problems, which involves trying different approaches and perspectives to ensure correctness, and this process is made possible by reinforcement learning [(02:34:25)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=9265s). - The model’s thinking process is reflected in its ability to write out a nice solution for humans, considering both correctness and presentation aspects, and this is what’s coming from the reinforcement learning process [(02:34:49)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=9289s). - The thinking process of LLMs is what’s giving higher accuracy in problem-solving and is where “aha moments” and different strategies are seen [(02:35:10)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=9310s). - Some people are nervous about putting sensitive data into [Online chat](https://en.wikipedia.org/wiki/Online_chat) due to its [Chinese language](https://en.wikipedia.org/wiki/Chinese_language) ownership, but there are alternative models available, such as [DeepSeek](https://en.wikipedia.org/wiki/DeepSeek) R1, which is an open-source model released by an American company [(02:35:32)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=9332s). - DeepSeek R1 is available for anyone to download and use, but running the full model in full precision requires significant computational resources, and many companies, including Together, host the full model [(02:35:45)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=9345s). - Together hosts various state-of-the-art models, including DeepSeek R1, which can be selected and used in their playground, and the results should be equivalent to those obtained from [Chat.com](http://Chat.com) [(02:36:12)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=9372s). - The models in [Chat.com](http://Chat.com), such as 01, 03, and [GPT-4](https://en.wikipedia.org/wiki/GPT-4), use advanced reasoning, which refers to their training by reinforcement learning with techniques similar to those used for DeepSeek R1 [(02:37:14)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=9434s). - Large Language Models ([Large language model](https://en.wikipedia.org/wiki/Large_language_model)) like [ChatGPT](https://en.wikipedia.org/wiki/ChatGPT) can be thought of as mostly Supervised Fine Tuning (SFT) models, rather than models that actually “think” like Reinforcement Learning (RL) models, despite some involvement of RL in these models [(02:37:43)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=9463s). - Thinking models, such as 03 mini high, may not be available to users unless they pay for a ChatGPT subscription, which can cost $20 or $200 per month for top models [(02:38:05)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=9485s). - When using a thinking model, it will display reasoning and chains of thought, but the web interface may not show the exact chains of thought, instead providing summaries, due to concerns about the distillation risk [(02:38:20)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=9500s). - The distillation risk refers to the possibility that someone could imitate the reasoning traces and recover the reasoning performance by just imitating the chains of thought [(02:38:46)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=9526s). - Thinking models, such as those in [DeepSeek](https://en.wikipedia.org/wiki/DeepSeek), are currently on par with models from [OpenAI](https://en.wikipedia.org/wiki/OpenAI) in terms of performance, but the evaluations can be difficult to compare [(02:39:14)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=9554s). - DeepSeek R1 is a solid choice for a thinking model that is available to users, either on the DeepSeek website or by downloading the open weights [(02:39:30)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=9570s). - Reinforcement learning can lead to the emergence of thinking in the process of optimization, particularly when applied to math and code problems with verifiable solutions [(02:39:48)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=9588s). - Thinking models can be accessed in DeepSeek or other inference providers, and are also available in [ChatGPT](https://en.wikipedia.org/wiki/ChatGPT) under the 01 or O3 models, but not in [GPT-4](https://en.wikipedia.org/wiki/GPT-4) R models [(02:40:04)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=9604s). - For prompts that require advanced reasoning, it is recommended to use thinking models, but for simpler questions, GPT 4 may be sufficient [(02:40:25)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=9625s). - Empirically, about 80-90% of use cases may only require GPT 4, while thinking models are better suited for more difficult problems, such as math and code, but may take longer to respond [(02:40:44)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=9644s). - AI Studio, specifically the [Gemini (chatbot)](https://en.wikipedia.org/wiki/Gemini_$chatbot$) model, offers a thinking model that is an early experiment in developing a thinking model by [Google](https://en.wikipedia.org/wiki/Google), which can solve thinking problems and produce the right answers [(02:41:15)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=9675s). - The Gemini thinking model is similar to other thinking models, such as those offered by other companies, and is part of the frontier development of Large Language Models ([Large language model](https://en.wikipedia.org/wiki/Large_language_model)) [(02:41:40)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=9700s). - Currently, as of early 2025, these thinking models are experimental and getting the details right is difficult, which is why they are still in the development stage [(02:41:55)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=9715s). - The development of these thinking models is an exciting stage in the field of LLMs, particularly in the area of Reinforcement Learning (RL), which is pushing the performance on very difficult problems using emerging reasoning and optimizations [(02:41:57)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=9717s). - [Anthropic](https://en.wikipedia.org/wiki/Anthropic) currently does not offer a thinking model, unlike some other companies, but the development of these models is ongoing and rapidly evolving [(02:41:42)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=9702s). ## [](#alphago-$02%3A42%3A07$)AlphaGo [(02:42:07)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=9727s) - The discovery of reinforcement learning as a powerful way of learning is not new to the field of AI, and it has been demonstrated in the game of Go with Deep Mind’s system [AlphaGo](https://en.wikipedia.org/wiki/AlphaGo) [(02:42:07)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=9727s). - AlphaGo learned to play the game of Go against top human players, and its underlying paper shows a plot comparing the strength of a model trained by supervised learning and a model trained by reinforcement learning [(02:42:32)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=9752s). - The supervised learning model imitated human expert players, but it topped out and never quite got better than top players like [Lee Sedol](https://en.wikipedia.org/wiki/Lee_Sedol), whereas the reinforcement learning model was significantly more powerful and could go beyond human performance [(02:43:09)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=9789s). - In reinforcement learning for the game of Go, the system plays moves that empirically and statistically lead to winning the game, and AlphaGo used this method to create rollouts and try out lots of solutions [(02:43:46)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=9826s). - The system learned the sequences of actions that empirically and statistically lead to winning the game, and reinforcement learning is not constrained by human performance [(02:44:24)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=9864s). - AlphaGo’s demonstration of reinforcement learning is powerful, and we are only starting to see hints of this in larger language models for reasoning problems [(02:44:47)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=9887s). - Imitating experts is not enough, and we need to set up game environments to let the system discover unique reasoning traces or ways of solving problems that work well [(02:44:58)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=9898s). - Reinforcement learning allows the system to veer off the distribution of how humans are playing the game, and AlphaGo’s move 37 is an example of a move that no human expert would play [(02:45:19)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=9919s). - [AlphaGo](https://en.wikipedia.org/wiki/AlphaGo), a computer program, made a rare move in a game of Go that was evaluated to be about 1 in 10,000, but in retrospect, it was a brilliant move that was unknown to humans, demonstrating the power of reinforcement learning [(02:45:45)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=9945s). - This move was surprising and seemed like a mistake at first, but it showed that AlphaGo had discovered a strategy that was unknown to humans through its training [(02:46:21)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=9981s). - The power of reinforcement learning can be applied to language models, and in principle, it can lead to the equivalence of discovering new strategies and thinking patterns that are unknown to humans [(02:46:39)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=9999s). - Solving problems in a way that even humans cannot understand raises questions about how language models can be better at reasoning and thinking than humans, and whether it means discovering new analogies, thinking strategies, or even a new language [(02:46:47)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=10007s). - The behavior of a language model is less defined and [OpenAI](https://en.wikipedia.org/wiki/OpenAI) to doing whatever works, including slowly drifting from the distribution of its training data, which is [English language](https://en.wikipedia.org/wiki/English_language) [(02:47:29)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=10049s). - To achieve this, a large and diverse set of problems is needed to refine and perfect the strategies of language models, which is a current area of research in large language models ([Large language model](https://en.wikipedia.org/wiki/Large_language_model)) [(02:47:49)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=10069s). - Creating practice problems for all domains of knowledge is necessary to enable language models to reinforcement learn and create new thinking patterns in the domain of open thinking [(02:48:06)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=10086s). ## [](#reinforcement-learning-from-human-feedback-$rlhf$-$02%3A48%3A26$)reinforcement learning from human feedback (RLHF) [(02:48:26)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=10106s) - Reinforcement learning in unverifiable domains is a challenging problem, as it is difficult to score candidate solutions against a concrete answer, unlike in verifiable domains where solutions can be easily scored against a known answer [(02:48:26)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=10106s). - In unverifiable domains, such as creative writing tasks like writing a joke or a poem, it is harder to score different solutions to a problem, and human evaluation is often required [(02:49:25)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=10165s). - However, relying on human evaluation is not scalable, as it would require a large amount of human time to evaluate thousands of prompts and generations [(02:50:47)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=10247s). - A potential solution to this problem is [Reinforcement learning from human feedback](https://en.wikipedia.org/wiki/Reinforcement_learning_from_human_feedback), which was proposed in a paper by [OpenAI](https://en.wikipedia.org/wiki/OpenAI) and is now being developed by [Anthropic](https://en.wikipedia.org/wiki/Anthropic) [(02:51:19)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=10279s). - This approach aims to enable reinforcement learning in unverifiable domains by using human feedback to train models, rather than relying on automated evaluation metrics [(02:51:33)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=10293s). - The native approach to reinforcement learning in unverifiable domains would require infinite human time, which is not feasible, and therefore alternative approaches like reinforcement learning from human feedback are needed [(02:51:41)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=10301s). - Large language models ([Large language model](https://en.wikipedia.org/wiki/Large_language_model)) like [ChatGPT](https://en.wikipedia.org/wiki/ChatGPT) can generate multiple solutions to a problem, but evaluating their quality is a challenge, especially in unverifiable domains [(02:49:43)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=10183s). - LLMs can be used to generate jokes, but evaluating their humor is a difficult task, and human evaluation is often required [(02:50:06)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=10206s). - Reinforcement learning from human feedback is a promising approach to training LLMs in unverifiable domains, but it requires careful design and implementation to ensure that the models learn to generate high-quality solutions [(02:51:35)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=10295s). - Running reinforcement learning (RL) in certain domains can be done with a large number of updates and rollouts, but it requires a huge amount of human evaluation, which is impractical [(02:51:46)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=10306s). - To address this issue, the approach of using a reward model is employed, which involves training a separate neural network to imitate human scores, allowing for indirect human involvement [(02:52:22)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=10342s). - The reward model is trained to simulate human preferences by asking humans to score a set of rollouts, and then the model imitates these scores, becoming a simulator of human preferences [(02:52:39)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=10359s). - With the reward model simulator, RL can be done against it, allowing for automatic querying and reinforcement learning without the need for real human evaluation [(02:52:58)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=10378s). - The simulator is not a perfect human, but if it is statistically similar to human judgment, it can still produce good results, and in practice, it has been shown to work [(02:53:22)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=10402s). - The process of training the reward model involves asking humans to order a set of rollouts from best to worst, rather than giving precise scores, which is an easier task for humans [(02:53:59)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=10439s). - The human ordering serves as supervision for the model, and separately, the reward model is asked to score the rollouts, taking the prompt and candidate joke as inputs and producing a single output score [(02:54:53)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=10493s). - A hypothetical reward model in the training process gives scores to jokes, ranging from 0 (the worst) to 1 (the best), with examples including 0.1 as a very low score and 0.8 as a really high score [(02:55:23)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=10523s). - The scores given by the reward model are compared to the ordering given by a human, using a precise mathematical way to calculate the correspondence and update the model [(02:55:43)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=10543s). - The goal is to make the reward model scores consistent with human ordering, with the model being updated to increase the score of jokes that humans find funnier and decrease the score of jokes that humans find less funny [(02:56:07)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=10567s). - The reward model becomes a better simulator of human scores and orders as it is updated on human data, allowing it to be used for reinforcement learning (RL) [(02:57:07)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=10627s). - The process involves training the model on a limited number of human-annotated prompts and rollouts, rather than requiring humans to look at a large number of jokes [(02:57:26)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=10646s). - The high-level idea is that the reward model gives a score that can be used to train the model to be consistent with human orderings, allowing for RL to be applied in arbitrary domains [(02:57:39)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=10659s). - The upside of [Reinforcement learning from human feedback](https://en.wikipedia.org/wiki/Reinforcement_learning_from_human_feedback) (RHF) is that it allows for RL to be applied in domains that are unverifiable, such as summarization, poem writing, and joke writing [(02:57:59)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=10679s). - Empirically, RHF has been shown to improve the performance of models, although the exact reason for this is not well established [(02:58:28)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=10708s). - The discriminator-generator gap is a possible reason why models like [ChatGPT](https://en.wikipedia.org/wiki/ChatGPT) improve, as it is often easier for humans to discriminate than to generate, especially in tasks like summarization, poem writing, or joke writing [(02:58:53)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=10733s). - Reinforcement learning from human feedback (RLHF) sidesteps the difficulty of human generation tasks by asking labelers to order model outputs instead of writing ideal responses, making the task easier and allowing for higher accuracy data [(02:59:35)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=10775s). - RLHF allows models to become better by discovering responses that would be graded well by humans, but it also comes with significant downsides, such as doing reinforcement learning with respect to a lossy simulation of humans rather than actual human judgment [(03:00:49)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=10849s). - The lossy simulation of humans in RLHF can be misleading, as it is just a language model outputting scores that may not perfectly reflect the opinion of an actual human [(03:00:59)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=10859s). - Reinforcement learning is extremely good at discovering ways to game the model or simulation, allowing models to find inputs that were not part of their training set and receive high scores in a fake way [(03:01:31)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=10891s). - The problem of gaming the model or simulation is exacerbated by the complexity of massive neural networks like [Transformer (deep learning architecture)](https://en.wikipedia.org/wiki/Transformer_$deep_learning_architecture$), which have billions of parameters and can be manipulated to produce high scores in a fake way [(03:01:46)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=10906s). - Large language models ([Large language model](https://en.wikipedia.org/wiki/Large_language_model)) like [ChatGPT](https://en.wikipedia.org/wiki/ChatGPT) can produce nonsensical results when given certain inputs, known as adversarial examples, which can exploit the model’s weaknesses and produce high scores despite being incorrect [(03:03:17)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=10997s). - These adversarial examples can be found by iterating through the model’s input space and discovering specific inputs that produce nonsensical results, which can be used to game the model’s reward function [(03:03:29)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=11009s). - Adding these adversarial examples to the model’s training data and giving them low scores can help the model learn to avoid them, but there will always be an infinite number of new adversarial examples that can be discovered [(03:03:51)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=11031s). - Reinforcement learning (RL) can be used to optimize the model’s performance, but it can also be used to game the model’s reward function, leading to nonsensical results [(03:04:11)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=11051s). - The scoring function used in RL is a complex neural network that can be tricked by the model, making it difficult to achieve optimal results [(03:04:23)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=11063s). - As a result, it’s not possible to run RL indefinitely, and the model must be “cropped” and shipped after a certain number of updates to prevent it from being gamed [(03:04:39)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=11079s). - In contrast, RL can be run indefinitely in verifiable domains, where the scoring function is simpler and more difficult to game [(03:05:32)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=11132s). - In these domains, RL can discover complex strategies that may not have been thought of before, and can perform well even after tens of thousands of steps [(03:05:36)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=11136s). - The difference between RL in [Large language model](https://en.wikipedia.org/wiki/Large_language_model) and RL in verifiable domains is that LLMs are not RL in the “magical sense”, meaning that they cannot be run indefinitely without being gamed [(03:05:03)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=11103s). ## [](#preview-of-things-to-come-$03%3A09%3A39$)preview of things to come [(03:09:39)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=11379s) - Future capabilities of Large Language Models (LLMs) include becoming multimodal, allowing them to handle not just text, but also audio and images natively and easily, enabling natural conversations [(03:09:40)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=11380s). - This multimodality will be achieved by tokenizing audio and images and applying the same approaches used for text, with audio being tokenized by looking at slices of the spectrogram of the audio signal, and images being tokenized using patches [(03:10:38)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=11438s). - Another future capability is the development of agents that can perform tasks over time, requiring supervision and potentially leading to a human to agent ratio in the digital space [(03:11:56)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=11516s). - These models will become more pervasive and invisible, integrated into tools and everywhere, similar to computer usage [(03:12:38)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=11558s). - [Large language model](https://en.wikipedia.org/wiki/Large_language_model) will also be able to take actions on behalf of users, with the launch of ChatGPT’s operator being an early example of this capability [(03:12:56)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=11576s). - Currently, most work involves handing individual tasks to models, but future models will be able to string together tasks to perform longer-running jobs, although they will still require supervision [(03:11:20)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=11480s). - The development of multimodal LLMs will enable them to handle streams of tokens representing audio, images, and text simultaneously in a single model [(03:11:06)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=11466s). - Large Language Models (LLMs) like [ChatGPT](https://en.wikipedia.org/wiki/ChatGPT) can potentially hand off control to perform keyboard and mouse actions on behalf of the user, which is an interesting development [(03:13:04)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=11584s). - There is still a lot of research to be done in the domain of LLMs, with one example being test time training, which allows the model to learn and adapt during inference [(03:13:16)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=11596s). - Current LLMs have a fixed set of parameters during inference and do not learn from the data they process, unlike humans who can learn and adapt based on their experiences [(03:13:37)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=11617s). - The context window in [Large language model](https://en.wikipedia.org/wiki/Large_language_model) is a finite and precious resource, and as tasks become longer and more multimodal, new ideas are needed to handle these tasks effectively [(03:14:21)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=11661s). - Current approaches to handling long contexts, such as making the context window longer, may not be sufficient, and new ideas are needed to scale to actual long-running tasks [(03:14:43)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=11683s). - Researchers and developers can expect new developments and advancements in the field of LLMs, particularly in areas such as test time training and handling long contexts [(03:15:00)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=11700s). ## [](#keeping-track-of-llms-$03%3A15%3A15$)keeping track of LLMs [(03:15:15)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=11715s) - El Marina is a leaderboard that ranks top LLM models based on human comparisons, where humans prompt the models and judge which one gives a better answer without knowing which model is which, and it provides a ranking of the models with their corresponding organizations and licenses [(03:15:15)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=11715s). - The leaderboard shows Google’s [Gemini (chatbot)](https://en.wikipedia.org/wiki/Gemini_$chatbot$) model currently on top, followed by [OpenAI](https://en.wikipedia.org/wiki/OpenAI), and [DeepSeek](https://en.wikipedia.org/wiki/DeepSeek) in position number three, with Deep Seek being an MIT license model that is open to anyone to use, download, and host their own version [(03:15:55)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=11755s). - The leaderboard also features other models from [Google](https://en.wikipedia.org/wiki/Google) and Open AI, as well as models from other organizations such as XAI, [Anthropic](https://en.wikipedia.org/wiki/Anthropic), and [Meta Platforms](https://en.wikipedia.org/wiki/Meta_Platforms), with some models being open weights and others being proprietary [(03:16:29)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=11789s). - However, it’s noted that the leaderboard has become a little bit gamed in the last few months, and it’s recommended to use it as a first pass and try out a few models for specific tasks to see which one performs better [(03:17:00)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=11820s). - AI News is a comprehensive newsletter produced by Swix and friends that is extremely helpful in staying up-to-date with the latest developments in the field, with a mix of human-written and automatically constructed content [(03:17:37)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=11857s). - The newsletter is produced almost every other day and covers a wide range of topics, making it a valuable resource for staying informed about the latest advancements in AI [(03:17:50)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=11870s). - Following people on X (formerly [Twitter](https://en.wikipedia.org/wiki/Twitter)) who are knowledgeable and trustworthy in the field of AI is also a good way to stay informed about the latest developments and advancements [(03:18:20)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=11900s). ## [](#where-to-find-llms-$03%3A18%3A34$)where to find LLMs [(03:18:34)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=11914s) - Proprietary models can be found on the websites of their respective providers, such as [OpenAI](https://en.wikipedia.org/wiki/OpenAI) for [ChatGPT](https://en.wikipedia.org/wiki/ChatGPT) and [Google](https://en.wikipedia.org/wiki/Google) for [Gemini (chatbot)](https://en.wikipedia.org/wiki/Gemini_$chatbot$) at [gem.google.com](http://gem.google.com) or AI Studio [(03:18:42)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=11922s). - Open weights models like DeepSE, CL, etc., can be accessed through inference providers of [Learning management system](https://en.wikipedia.org/wiki/Learning_management_system), with Together.a being a popular option that offers a playground to interact with various models [(03:19:04)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=11944s). - Base models can be found on platforms like Hyperbolic, which hosts the [Llama (language model)](https://en.wikipedia.org/wiki/Llama_$language_model$) 3.1 base model, although base models are not as commonly available as other types of models [(03:19:37)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=11977s). - Smaller models can be run locally on a computer, with options like DeepSE offering distilled versions that can be run at lower precision, making them more accessible for local use [(03:19:57)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=11997s). - LM Studio is a popular app for running models locally, offering a range of models, including distilled and lower-precision versions, although the interface can be confusing and geared towards professionals [(03:20:37)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=12037s). ## [](#grand-summary-$03%3A21%3A46$)grand summary [(03:21:46)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=12106s) - When a query is entered into a website like [ChatGPT](https://en.wikipedia.org/wiki/ChatGPT), it is first chopped up into tokens using a tokenizer, and then inserted into a conversation protocol format, which is a way of maintaining conversation objects, resulting in a one-dimensional token sequence [(03:22:10)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=12130s). - The token sequence is then continued by the model, acting like a token autocomplete, when the user hits the “go” button, generating a response [(03:22:46)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=12166s). - The response generated by the model is a result of a three-stage process, the first stage being the pre-training stage, where the neural network internalizes knowledge from the internet [(03:23:25)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=12205s). - The pre-training stage is followed by supervised fine-tuning, where a company like [OpenAI](https://en.wikipedia.org/wiki/OpenAI) curates a large dataset of conversations and hires human data labelers to create ideal assistant responses for arbitrary prompts [(03:23:44)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=12224s). - The human data labelers are given labeling instructions and task is to teach the neural network by example how to respond to prompts, resulting in the neural network simulating a data labeler at OpenAI [(03:24:14)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=12254s). - The response generated by the model is a simulation of a human data labeler’s response, but it’s not actually a human response, and it’s generated quickly, unlike a human who would take hours to write a response [(03:24:57)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=12297s). - The neural network simulation is different from human brain function, and what’s easy or hard for the network is different from what’s easy or hard for humans [(03:25:08)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=12308s). - The model’s response is fundamentally a token stream, generated by the neural network with a bunch of activations and neurons in between [(03:25:17)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=12317s). - Large Language Models ([Large language model](https://en.wikipedia.org/wiki/Large_language_model)) like [ChatGPT](https://en.wikipedia.org/wiki/ChatGPT) use a fixed mathematical expression to mix inputs from tokens with model parameters, generating the next token in a sequence, but this process is a finite and lossy simulation of human thought [(03:25:26)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=12326s). - LLMs suffer from limitations, including hallucinations and a “Swiss cheese” model of capabilities, where they may arbitrarily make mistakes or struggle with certain tasks due to insufficient tokens or mental arithmetic breakdowns [(03:26:10)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=12370s). - The models’ capabilities are restricted by their computational process, which imitates human data labelers following labeling instructions, but with limitations and potential errors [(03:26:50)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=12410s). - Thinking models, such as o03 mini, use reinforcement learning (RL) to perfect their thinking process and discover new strategies, which can lead to more unique and interesting responses [(03:27:35)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=12455s). - These thinking models practice on a large collection of curated problems, allowing them to develop internal monologues and problem-solving strategies that resemble human thought [(03:27:51)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=12471s). - When interacting with thinking models, the responses are not just simulations of human data labelers, but rather emergent functions of thinking that arise from the RL process [(03:28:27)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=12507s). - The transferability and generalizability of thinking strategies developed in verifiable domains to other areas are still [OpenAI](https://en.wikipedia.org/wiki/OpenAI) questions [(03:28:54)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=12534s). - The extent to which reinforcement learning (RL) can be applied to unverifiable domains is unknown, and it is unclear if the benefits of RL can be transferred to such domains [(03:29:05)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=12545s). - The current state of RL in large language models ([Large language model](https://en.wikipedia.org/wiki/Large_language_model)) is still in its early stages, and the field is just beginning to see hints of greatness in reasoning problems and open-domain thinking [(03:29:22)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=12562s). - In principle, these models are capable of achieving something equivalent to move 37 in the game of Go, but in open-domain thinking and problem solving, and may even come up with analogies that no human has thought of before [(03:29:36)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=12576s). - These models will likely shine in verifiable domains such as math and code, but may struggle with unverifiable domains [(03:30:05)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=12605s). - Despite their shortcomings, LLMs can be incredibly useful tools for accelerating work and can be used daily to achieve significant productivity gains [(03:30:23)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=12623s). - However, it is essential to be aware of the limitations of these models, including their tendency to randomly do dumb things, hallucinate, or skip over mental arithmetic, and to use them as tools in a toolbox rather than trusting them fully [(03:30:40)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=12640s). - To get the most out of LLMs, it is recommended to use them for inspiration, first drafts, and asking questions, but always to check and verify their work and own the product of your work [(03:30:58)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=12658s). - The use of [Large language model](https://en.wikipedia.org/wiki/Large_language_model) has the potential to create a huge amount of wealth, but it is crucial to be aware of their shortcomings and to use them responsibly [(03:30:33)](https://www.youtube.com/watch?v=7xTGNNLPyMI&t=12633s).