The LLM probability distribution

# How to sample text from trained LLMs Once we have a trained LLM, given some text, it returns a probability distribution. How can we use this probability distribution to generate new text? This is a deeper question than it seems, and that will be the topic of today’s lecture. --- ## Reminder on LLM training Let’s remember how is an LLM generally (pre-)trained. The dataset $\mathcal{D}$ we consider is a large corpus of text, which are in practice segmented into sequences of token $\mathbf{x}^\mathcal{D}=(x_1^\mathcal{D}, x_2^\mathcal{D}, \ldots, x_n^\mathcal{D})\,,\, x_i \in \mathcal{V}$ where $\mathcal{V}$ is a finite vocabulary of tokens. The model, given $\mathbf{x}^\mathcal{D}$ will output $\mathbb{P}_{\mathcal{M}}(x_i=x | x^\mathcal{D}_{i-1}, \ldots, x^\mathcal{D}_{1})$ for each position $i$ in the sequence. The loss that is minimized during training is cross-entropy: $ \mathcal{L} = - \sum_{i=1}^n \log \mathbb{P}_{\mathcal{M}}(x_i = x_i^\mathcal{D} | x_{i-1}^\mathcal{D}, \ldots, x_1^\mathcal{D}) $ This is the loss for one sequence, the model is trained to minimize it on average over the dataset $\mathcal{D}$. --- In reality, the models output logits $\ell_i(x) \propto \log \mathbb{P}_{\mathcal{M}}(x_i=x | x_{i-1}, \ldots, x_1)$, which are unnormalized log-probabilities. To turn them into probabilities, we use the softmax function: $ \mathbb{P}_{\mathcal{M}}(x_i=x | x_{i-1}, \ldots, x_1) = \frac{e^{\ell_i(x)}}{\sum_{x' \in \mathcal{V}} e^{\ell_i(x')}}. $ See the exercise sheet 11 for more details. --- ## Generating text using the LLM Assume we have now access to $\mathbb{P}_{\mathcal{M}}$, how can we use it to generate text? This is a 'sampling' problem, and the answer depends on what we want to achieve. In all cases, we will need to proceed auto-regressively; to generate token $x_i$, we need the preceeding tokens $x_1, \ldots, x_{i-1}$, otherwise we cannot compute its probability distribution. So we will generate tokens one by one. --- ### Greedy Sampling The first idea that comes to mind is to choose the most probable token at each step. This is greedy sampling. More precisely, to generate the $i$-th token, we choose: $ x_i = \arg\max_{x \in \mathcal{V}} \mathbb{P}_{\mathcal{M}}(x_i=x | x_{i-1}, \ldots, x_1). $ --- In terms of producing plausible text, this is very bad. Indeed, consider a simple dataset of sequences of length $n$ composed of three tokens $\{a,b,c\}$, with probabilities: - $\mathbb{P}(a) = 1/2$ - $\mathbb{P}(b) = 1/4$ - $\mathbb{P}(c) = 1/4$ With greedy sampling, we would only generate sequences of only $as. But in the dataset, a typical sequence contains $n/2$ $as, $n/4$ $bs and $n/4$ $cs. This is an extreme example, but this happens also in practice with LLMs, where greedy sampling can lead to repetition quite easily. --- ### Probabilistic Sampling Another simple idea is to sample the next token according to the probability distribution given by the model, which is also very natural. We can introduce a temperature $T$ as a hyperparameter. To generate the $i$-th token, we will randomly sample $x_i$ according to: $ \mathbb{P}_{T}(x_i=x | x_{i-1}, \ldots, x_1) = \frac{e^{\ell_i(x)/T}}{\sum_{x' \in \mathcal{V}} e^{\ell_i(x')/T}} \equiv \mathbb{P}^i_{T}(x) $ We can show that when $T=1$, the distribution of the generated texts matches exactly the dataset distribution, assuming the model is perfect. So why use $T\neq 1$? --- In practice, models are not perfect. Additionally, when sampling many tokens, the likelihood of generating low-probability tokens increases. Because the model is autoregressive, a single 'alien' token can lead to incoherent text. I am eating a delicious advection fluid flow metric is important This issue also comes from the fact that the model is trained to predict ONE token. It is not explicitly trained to generate many tokens. --- So, $T$ is a dial with trade-offs: - $T<1$ favors high-probability tokens, but might suffer from the same issues as greedy sampling. Additionally, it might generate texts that are too close to the training data. - $T>1$ favors 'creativity', sampling low-probability tokens more often, but might lead to incoherent texts. --- ### Top-p and Top-k sampling One of the drawbacks of probabilistic sampling is the 'alien' tokens that might pollute the context, and degrade the next predictions. Typically, the 'alien' tokens have a probability which is essentially zero (numerically, so $\approx 10^{-8}$). But if $|\mathcal{V}|=10^5$, the likelihood of sampling one of these tokens is not negligible. Top-k and top-p sampling are two ways to mitigate this issue. --- For top-k, we sample probabilistically, but only among the $k$ most probable tokens. This ensures that we never sample very low-probability tokens. However, since $k$ is fixed, this might lead to problems when the probability distribution is either very peaked or very flat. ![[tokenstopk.jpeg|center|600]] --- For top-p we choose a threshold $p \in [0,1]$, and consider only a subset of tokens such that their cumulative probability is at least $p$. More precisely, we find the smallest set $\mathcal{S}_i$ such that: $\sum_{x \in \mathcal{S}_i} \mathbb{P}_{\mathcal{M}}(x_i=x | x_{i-1}, \ldots, x_1) \geq p.$ And sample among this limited subset. This method is like top-k, but it is adaptive to the shape of the probability distribution. ![[tokenstopp.jpeg|center|600]] --- In both cases, these tricks are 'theoretically unjustified', in the sense that when we use them, the generated text distribution does NOT match the learned distribution of the model. It turns out that in practice, for imperfect models, these methods often lead to better results. ## Sampling for tasks Often for LLMs, the goal is not to generate text that matches the learned distribution, but to perform a specific task. Consider first the task of answering yes/no questions. Say we provide the model with a question such as 'Is the sky blue?'. If we continue with probabilistic sampling, we might get 'yes', 'no', or even 'indeed', 'why', etc. Instead, in this case, we should use greedy sampling, and limit the output to 'yes' and 'no'. --- More generally, consider tasks such as 'write a function that computes the n'th Fibonacci number'. Denote the 'prompt' by $\mathcal{P}$, and the model completion by $\mathcal{C}$. If we have faith in the model, we might assume that the correct answer $\mathcal{C}^*$ is the most probable completion : $\mathcal{C}^* = \arg\max_{\mathcal{C}} \mathbb{P}_{\mathcal{M}}(\mathcal{C} | \mathcal{P}).$ --- So the job is now to sample the most probable completion. If the completion contains $n_c$ tokens, the number of possible completions is $|\mathcal{V}|^{n_c}$, intractable. One approach is to use beam search, which is a greedy algorithm that keeps track of the $B$ (beam-size) most probable partial completions. We start with the top $B$ most probable tokens. Then, at each step, we consider the $B|\mathcal{V}|$ possible continuations of the $B$ partial completions, and keep only the top $B$ among them, and repeat. Total complexity is $O(n_c\times B \times |\mathcal{V}|)$ ![[beamsampling.jpeg]] *Beam sampling example with a beam of size 3* --- ### Post-training for tasks In practice, the LLM is usually post-trained on a dataset of (prompt, completion) pairs, using RL or cross-entropy loss. This training shapes the probability distribution to favor only the correct completion, making standard sampling more effective. This destroys however the generative capabilities of $\mathbb{P}_{\mathcal{M}}$, so the model completions will now be significantly different from the dataset distribution. --- # Other uses of the LLM probability distribution Traditionally, the distribution $\mathbb{P}_{\mathcal{M}}$ is used as a tool to generate text. But generally, this is only one aspect of what can be done with it. As we have seen in the 'sky is blue' example, there are other ways of exploiting the probability distribution. --- ## Implicit vs explicit knowledge Without going into details, we can separate the knowledge contained in a LLM as being 'explicit' or 'implicit'. Explicit knowledge is the knowledge that can be probed by simple sampling of the LLM measure; namely asking the LLM a question, and looking at the answer. Implicit knowledge is potential much vaster; it is the knowledge that we can extract (with some reasonable amount of computation) from the LLM probability measure. --- For instance, when asked to solve an exercise, a model might have very low accuracy or entirely fail to provide the correct answer. But, by analyzing the cross-entropy of different proof steps, it might be that the model still assigns bigger probabilities to correct steps, which would make it possible to craft a proof using the LLM measure in an intricate way. The information is in the LLM probability distribution, but not accessible from simple sampling. --- A fun example (from Clément, see [https://www.xentlabs.ai/game](https://www.xentlabs.ai/game)) is the 'summarization game'. Given a text paragraph $\mathcal{P}$, the goal is to find a pre-fix $\mathcal{T}$ such that the cross-entropy $-\log \mathbb{P}_{\mathcal{M}}(\mathcal{P} | \mathcal{T})$ is minimized (where $\mathcal{M}$ is a fixed LLM). --- ## Arrows of Time A surprising example of unexpected signal coming from the LLM probability distribution arises when we try to train a model to predict the previous token, instead of the next one. The order in which the LLM is trained to predict tokens is arbitrary. For chatbots, it is natural to predict the next token (you write in the forward direction). But you can just as well train a model to predict the previous token (it writes backwards). --- It turns out that, (counterintuitively) the order in which the tokens are predicted is irrelevant from the information-theoretical point of view (c.f. exercise sheet 11). For both directions, the optimal achievable cross-entropy loss is the same. Still, when training identicals LLMs on identical data in different directions (forward and backward), we observe the following loss graph during training: ![[drawn_languagediffs.png| center | 600]] *Cross-entropy loss during training, for different languages FW and BW. The jump in the middle is due to learning rate reset.* --- All languages (no exception) show a mild (0.5-2%) but systematic lower loss for the forward model, which we coin the 'Arrow of Time' of language, discovered through LLMs. How can that arise? One part of the explanation comes from computational hardness. If we consider a set of sentences of the form $p_1 \times p_2 = n$, where $p_1, p_2$ are prime numbers. Though both $p_1\times p_2$ and $n$ contain the same information, predicting $n$ given the primes is easy, but the reverse is hard! You will explore this example in more the depth in the exercises. --- This does not explain the consistent direction of the AoT. For that, consider a toy model of 'linear languages'. All sentence are of the form $x \Leftrightarrow y$, where $x,y$ are bit strings related by an (invertible) linear transformation $y=A^{\rightarrow} x,\,\,x = A^{\leftarrow} y,\,\,(A^{\rightarrow})^{-1}=A^{\leftarrow}$ over $\mathbb{F}_2$. Forward model : need to learn $A^{\rightarrow}$. Backward model : need to learn $A^{\leftarrow}$. Key observation: if the forward direction is 'simple' ($\approx A^{\rightarrow}$ is sparse), then the backward direction will in general be more 'complex' ($\approx A^{\leftarrow}$ is denser)! --- To apply this reasoning to natural languages, there is still some work to do. Roughly, the idea is that to communicate efficiently, we will utilize language that is 'simple' to learn, and it will necessarily be simple in the direction in which we communicate (forward). By the reasoning above, the reverse direction will be more less sparse, so harder to learn! --- The whole story likely is more complicated: - Environment is time-asymmetric (causal), this may leak into language. - Link with irreversibility, thermodynamics? - Languages is shaped at the level of population, not individual agents. Related questions (see more at [https://arxiv.org/abs/2401.17505](https://arxiv.org/abs/2401.17505)): - What else has an AoT ? Code (yes), DNA (no), animal communication (unknown)? - Is AoT a signature of intelligent processing? Do natural timeseries (e.g. solar flares) have AoT? - AoTs in other modalities (video, audio)?