generative AI - Obsidian Publish

## generative models Generative models learn distributions. Discriminative models, in contrast, learn decision boundaries to discriminate between predictions or classifications based on data distributions. In terms of probabilities, the generative model is concerned with predicting $X$ given some label $y$ as in $P(X | y)$ whereas the discriminative model is concerned with predicting $y$ given some values $X$ as in $P(y|X)$. ## generative space Generative space refers to the set of all possible values a generative model can create. Consider a generative model to create a 4-pixel image with three color values R, G, and B. There are 4 slots and a "vocabulary" of 3 resulting in $3^4 = 81$ possibilities. A 4-word sentence with a vocabulary of 5 words provides $5^4 = 625$ possibilities. In general, the number of possibilities is the vocabulary raised to the number of dimensions. $V^D$ For multi-modal examples, for example creating an image and generating a caption, calculate the generative space for each and multiply together. For example, the generative space for a 32-pixel image with 3-channel, 256-bit color combined with 20 word caption with 10,000 word vocabulary would be $256^3 * 256^3 * 256^3 * 10000^{20}$ As you can imagine, the generative space grows quite fast for applications like language and image generation! ## inverse transform sampling Inverse transform sampling is a method for sampling a joint probability distribution sometimes used in generative models. - Compute [[cumulative density function|cdf]] - Generate a random number from 0 to 1 - Using the random number, select a sample from the cdf - Repeat to create a batch of samples There is no magic here, it is as simple as it sounds. ## chain rule of probabilities The chain rule of probabilities states that the joint probability of a sequence of events is the product of the probabilities of each event given the preceding events have occurred. $P(X_1, X_2, X_3) = P(X_1) * P(X_2 | X_1) * P(X_3 | X_1, X_2)$ This simplifies calculation of joint probability when only conditional probabilities are provided as in an [[aggregation tree]]. ## generative adversarial network Generative adversarial nets (GAN)s are responsible for significant progress in image generation since first introduced by [[Ian Goodfellow]] in 2015. GAN is used for image interpolation, modification of attributes, change image style, and other image modification tasks. GANs pair a generator model and discriminator model to learn a distribution iteratively. The discriminator model learns a decision boundary between the output of the generator model and real examples. The decision boundary (but not the real examples) is fed back to the generator model to create increasingly realistic outputs, and the discriminator learns a new decision boundary. Over time, the distribution of generated examples becomes increasingly like the distribution of real examples. From Goodfellow's paper, "The generative model can be thought of as analogous to a team of counterfeiters, trying to produce fake currency and use it without detection, while the discriminative model is analogous to the policy, trying to detect the counterfeit currency." ## autoencoder Autoencoders encode an input to a lower dimensional representation in a hidden layer (or latent code) and then decoding to an output of the same dimension as the input. In doing so, the autoencoder learns the salient features of the input. ## generative comparative network As opposed to a [[generative adversarial network]], the generative comparative network learns by comparing distributions between the generator outputs and real examples. ## language model A language model calculates probabilities for string completions. A language model becomes a Large Language Model (LLM) when it has billions or trillions of parameters and/or really long input strings (called context windows). ### autoregressive model An autoregressive model is a model for which the prediction for $x$ depends on all of the preceding points. $x(t) = f(x_{(t-n)})$ for all $n$. Contrast with a regressive model that predicts a point $y$ based on other inputs $x$. A [[Large Language Model]] is [[autoregressive]] because the next word it predicts is based on all of the preceding words predicted. GPT is an autoregressive language model. ## recurrent neural network ### Encoder-Decoder An encoder-decoder model encodes an input to vector space and then uses the encoded input and a decoder to get a response. The encoder and decoder are both a [[recurrent neural network]]. Note that encoder-decoder models are regressive, not [[autoregressive]]. ## attention The attention mechanism, specifically the query-key-value (QKV) paradigm, is used in models like Transformers to capture dependencies between words or tokens in a sequence. For example, when we read a word “eat”, we may wonder “what is eaten” and look for other words that are “edible.” Here's how that relates to the QKV paradigm in attention: - **Query**: This represents the item you're currently focused on or considering. In the example given, when reading "eat", the question or the context formed in the mind is "what is eaten?". - **Key**: The keys represent all the possible items you could be considering in relation to the query. In this case, the contextual understanding of something being "edible" relates to the action of "eat". - **Value**: The values are the actual items or details you retrieve based on how well the keys matched the query. If "edible" matches well with "what is eaten?", then you retrieve or focus on the actual word or concept, which is "eat" in this case. In [[transformer]] architecture, attention is used to group related words or tokens. These are provided into the feed forward network (FFN) to generate new features from the grouped tokens. This is repeated many times, for example ChatGPT has 92 such layers. ## transformer ## diffusion model Took over from [[generative adversarial network]] for image generation in 2020. In 2025, ChatGPT released its new image generation model which used [[autoregressive]] techniques rather than diffusion, which creates more realistic text and has fewer issues with numbers of fingers, etc.