# ViLT - [ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision](https://arxiv.org/abs/2102.03334) - Vision-and-Language Transformer - seeks to improve performance on various joint vision-and-language downstream tasks - Current approaches to VLP heavily rely on image feature extraction processes using convolutional visual [Embedding](Embedding.md) networks (e.g., Faster R-CNN and ResNets), which involve region supervision (e.g., object detection) and the convolutional architecture (e.g., ResNet) - This is problematic in terms of both efficiency/speed, in that extracting input [Features](Features.md) requires much more computation than the multimodal interaction steps; and expressive power, as it is upper bounded to the expressive power of the visual embedder and its predefined visual vocabulary. - minimal VLP model, which is monolithic in that the processing of visual inputs is drastically simplified to just the same convolution-free manner that they process textual inputs - removing the need for object detectors - avoiding heavyweight image encoders by directly [Embedding](Embedding.md) low-level pixel data with a single-layer projection and achieves similar results with reduced complexity, - Self-supervision is accomplished using (i) Image Text Matching (ITM) [loss](../Tag%20Pages/loss.md) and (ii) Masked Language Model (MLM) [loss](../Tag%20Pages/loss.md) - [ITM Loss](ITM%20Loss.md) - For text, ViLT simply reuses Masked Language Model - (MLM), used in BERT. - [MSCOCO](MSCOCO.md) - [Visual Genome](Visual%20Genome.md) - [SBU Captions](SBU%20Captions.md) - [Google Conceptual Captions](Google%20Conceptual%20Captions.md) - [VQAv2](VQAv2.md) - [NLVR2](NLVR2.md) - [Flickr30K](Flickr30K.md) - ViLT is over 10x faster than previous VLP models, yet with competitive or better downstream task performance - VLP needs to focus more on the multi-[Modality](Modality.md) interactions aspect inside the transformer module rather than engaging in an arms race that merely powers up unimodal embedders