--- toc: true title: BEiT tags: ['temp'] --- # BEiT - [BEiT: BERT Pre-Training of Image Transformers](https://arxiv.org/abs/2106.08254) - [[Self Supervised]] pre-trained representation model - Bidirectional [[Encoder Decoder Attention]] representations from [[Vision Transformer]] - masked image modeling task to pretrain vision Transformers - each image has two views in their pre-training - the embeddings of which are calculated as linear projections of flattened patches - visual tokens - discrete [[VAE]] (dVAE) which acts as an “image [[Tokenizer]]” learnt via autoencoding-style reconstruction - input image is tokenized into discrete visual tokens obtained by the latent codes of the discrete [[VAE]] - proposed method is critical to make [[BERT]] like pre-training (i.e., auto-encoding with masked input) work well for image Transformers - automatically acquired knowledge about semantic regions, without using any human-annotated data - randomly masks some image patches and feeds them into the backbone [[Transformer]] - pre-training objective is to recover the original visual tokens based on the corrupted image patches - directly fine-tune the model parameters on downstream tasks by appending task [[Layers]] upon the pretrained encoder - [[ImageNet]] - outperforming from-scratch [[DeiT]]