Temperature and Softmax (LLMs) - Iz's Morning Notes

Temperature is a parameter that controls how random is the text that the LLM generates. One way to think about it is that when generating each token, the LLM knows what is the most likely token, 2nd most likely token (**logits**). When actually choosing, the LLM balances between *predictability/truthfulness* and *creativity*. Temperature balances the two. In mathematical terms, we're trying to translate the **logits** (which you can think of "scores") to probabilities: ${L_1,\ldots,L_n} \rightarrow {p_1 \ldots p_n} $ We do it with the softmax function which guarantees: $p_i \propto \exp(\frac{L_i}{T})$ Where T is the temperature. The $\propto$ mark just denotes that we have to "normalize" the probabilities so they all sum to 1. In code: ``` probs = np.exp(logits/temperature) probs = probs/np.sum(probs) ``` How temperature influences the distribution: * If temperature is 1, it becomes $p_i \propto \exp(L_i)$, i.e. we're picking based on the input probabilities (logits raised by e) exactly. * If temperature is <1, we "sharpen" the distribution, favoring the more likely tokens even more. [[In the Limit]] as t approaches 0, the output becomes "greedy", outputting just the max probability token. * If temperature is >1, we "flatten" the distribution. [[In the Limit]] as t grows, it would just be a random token generator. I vibe-coded a quick visualizer for this. Play with it below: <iframe src="https://srulix.com/projects/temperature?standalone=true" width="100%" height="800px" frameborder="0"></iframe> ### Softmax A really insightful way to think about the name softmax comes from Ian Goodfellow, in "Deep Learning": > *The name “softmax” can be somewhat confusing. The function is more closely related to the arg max function than the max function. The term “soft” derives from the fact that the softmax function is continuous and differentiable. The arg max function, with its result represented as a one-hot vector, is not continuous or differentiable. The softmax function thus provides a “softened” version of the arg max. The corresponding soft version of the maximum function is softmax(z)⊤z. It would perhaps be better to call the softmax function “softargmax,” but the current name is an entrenched convention.* ### Physics origin The origin of the name "temperature" comes from... statistical mechanics! Temperature in statistical mechanics is like a "bounciness control" for particles in a system. When temperature is low, particles have little energy, so they mostly stay in their lowest energy states - like balls settling at the bottom of a jar. They behave in predictable, orderly ways, similar to molecules in ice that are locked into rigid positions. This is why ice molecules don't move. As temperature increases, particles gain more energy and can access higher energy states, bouncing around more randomly. At high temperatures, particles can reach almost any energy state with similar probability, creating disorder and unpredictability - like balls bouncing wildly throughout a jar, or molecules in steam moving freely in all directions. This is why gas molecules move a lot. The relationship between the bounciness (temperature) to what energy state we expect the particles to be (height) is given via Bolzmann distribution. In the LLM case, the "height" is how much down the line in the "most probable tokens" the model will go. Low temperature, the ball doesn't bounce - it just gives the most probable, greedy answer. High temperature, the ball can bounce pretty high - and pick an "unlikely" token. It can be a creative novel, or it can be a hallucination! #published 2025-03-01