🔎 Tokens - OwnFoundations - Obsidian Publish

In the context of [[🔎 Large Language Models (LLMs)]], **tokens** are the fundamental units of text that the model processes. Unlike how humans think in terms of whole words or sentences, LLMs break text down into smaller chunks called tokens. A token can be: - A whole word (like "cat" or "the") - Part of a word (like "ing" or "pre") - A single character (especially for uncommon words) - A punctuation mark - A space **For example:** The phrase "ChatGPT is amazing!" might be broken into tokens like ``` ["Chat", "GPT", " is", " amazing", "!"] ``` In most modern LLMs like GPT-4, a token typically corresponds to about ¾ of a word in English text. A typical page of text (500 words) is about 700 tokens.