n-gram language model - Obsidian Publish

An N-gram [[language model]] use counts of sequences of length $N$ gathered from a corpus to compute probability for a given sequence. The probability of a sequence is simply the product of the probabilities of each element, as given by its frequency in the corpus, conditioned on the $N$ previous elements. $P(w_{1:n}) = \prod P(w_i | w_{1:i-1})$ This relies on the assumption that words are independent, which for historical reasons is known as the [[Markov independence assumption]]. Higher values of $N$ provide more context but base predictions on smaller frequencies, and risk encountering sequences that do not exist. One solution is limiting the length of the sequence on which the next element is conditioned, thus we fix $N$ as in a bigram ($N=2$) or trigram model ($N=3$). Larger values of $N$ tend to reproduce the corpus exactly. Sequences that don't appear in the corpus will be assigned a probability of 0. Arbitrarily long sequences will not be found in the corpus. [[Smoothing]] is used to add back some probability to unseen sequences. According to [[Zipf's law]], n-gram distributions have a long tail, resulting in a sparse matrix of n-grams, which can complicate computation. N-gram language models can be used with [[autoregressive generation]] to produce new texts. This notion is attributed to [[Claude Shannon]]. The use of n-gram language models peaked in the mid-2000s. Neural language models eliminated the need for complex smoothing techniques and allowed for longer contexts with the rise of recurrent neural networks.