# How to sample text from trained LLMs
Once we have a trained LLM, given some text, it returns a probability distribution. How can we use this probability distribution to generate new text? This is a deeper question than it seems, and that will be the topic of today’s lecture.
---
## Reminder on LLM training
Let’s remember how is an LLM generally (pre-)trained.
The dataset $\mathcal{D}$ we consider is a large corpus of text, which are in practice segmented into sequences of token $\mathbf{x}^\mathcal{D}=(x_1^\mathcal{D}, x_2^\mathcal{D}, \ldots, x_n^\mathcal{D})\,,\, x_i \in \mathcal{V}$ where $\mathcal{V}$ is a finite vocabulary of tokens.
The model, given $\mathbf{x}^\mathcal{D}$ will output $\mathbb{P}_{\mathcal{M}}(x_i=x | x^\mathcal{D}_{i-1}, \ldots, x^\mathcal{D}_{1})$ for each position $i$ in the sequence. The loss that is minimized during training is cross-entropy:
$
\mathcal{L} = - \sum_{i=1}^n \log \mathbb{P}_{\mathcal{M}}(x_i = x_i^\mathcal{D} | x_{i-1}^\mathcal{D}, \ldots, x_1^\mathcal{D})
$
This is the loss for one sequence, the model is trained to minimize it on average over the dataset $\mathcal{D}$.
---
In reality, the models output logits $\ell_i(x) \propto \log \mathbb{P}_{\mathcal{M}}(x_i=x | x_{i-1}, \ldots, x_1)$, which are unnormalized log-probabilities. To turn them into probabilities, we use the softmax function:
$
\mathbb{P}_{\mathcal{M}}(x_i=x | x_{i-1}, \ldots, x_1) = \frac{e^{\ell_i(x)}}{\sum_{x' \in \mathcal{V}} e^{\ell_i(x')}}.
$
See the exercise sheet 11 for more details.
---
## Generating text using the LLM
Assume we have now access to $\mathbb{P}_{\mathcal{M}}$, how can we use it to generate text? This is a 'sampling' problem, and the answer depends on what we want to achieve.
In all cases, we will need to proceed auto-regressively; to generate token $x_i$, we need the preceeding tokens $x_1, \ldots, x_{i-1}$, otherwise we cannot compute its probability distribution. So we will generate tokens one by one.
---
### Greedy Sampling
The first idea that comes to mind is to choose the most probable token at each step. This is greedy sampling. More precisely, to generate the $i$-th token, we choose:
$
x_i = \arg\max_{x \in \mathcal{V}} \mathbb{P}_{\mathcal{M}}(x_i=x | x_{i-1}, \ldots, x_1).
$
---
In terms of producing plausible text, this is very bad. Indeed, consider a simple dataset of sequences of length $n$ composed of three tokens $\{a,b,c\}$, with probabilities:
- $\mathbb{P}(a) = 1/2$
- $\mathbb{P}(b) = 1/4$
- $\mathbb{P}(c) = 1/4$
With greedy sampling, we would only generate sequences of only $a