## Phenaki - capable of performing realistic video synthesis, given a sequence of textual prompts - Phenaki is the first model that can generate videos from open domain time variable prompts - To address data issues, it performs joint training on a large image-text pairs dataset as well as a smaller number of video-text examples can result in generalization beyond what is available in the video datasets. - image-text datasets having billions of inputs - limitations come from computational capabilities for videos of variable length - the C-ViViT encoder, the training transformer and the video generator - The encoder gets a compressed representation of videos. - First tokens are transformed into embeddings. - This is followed by the temporal transformer, then the spatial transformer - After the output of the spatial transformer, they apply a single linear projection without activation to map the tokens back to pixel space - Consequently, the model generates temporally coherent and diverse videos conditioned on open domain prompts even when the prompt is a new composition of concepts