MLIM - Subhaditya's Website - Obsidian Publish

# MLIM - [MLIM: Vision-and-language Model Pre-training with Masked Language and Image Modeling](https://arxiv.org/abs/2109.12178) - Vision-and-Language Pre-training (VLP) improves model performance for downstream tasks that require image and text inputs - Typically, in addition to the [Masked Language Modeling] (MLM) loss, alignment-based objectives are used for cross-[modality](Masked Language Modeling] (MLM) loss, alignment-based objectives are used for cross-[modality.md) interaction, and RoI feature regression and classification tasks for Masked ImageRegion Modeling (MIRM) - Alignment-based objectives require pairings of image and text and heuristic objective functions - Masking policies either do not take advantage of multi-[Modality](Modality.md) or are strictly coupled with alignments generated by other models - pre-trained using two pre-training tasks as a multi-loss objective given a mini-batch of image-text pairs: [Masked Language Modeling] (MLM) loss (as in BERT) for text, and image reconstruction (RECON) loss for image, coupled with [Modality](Masked Language Modeling] (MLM) loss (as in BERT) for text, and image reconstruction (RECON) loss for image, coupled with [Modality.md) Aware Masking (MAM) - determines the masking probability and applies masking to both word and image [Embedding](Embedding.md) - based on BERT predict the masked words from available words and image regions - follow BERT for this task: two-layer MLP MLM head outputting [Logits](Logits.md) over the vocabulary - MLM loss is negative log-likelihood for masked word - RECON loss is an an average of pixel-wise sum of squared errors (SSE) - Both image and word masking is realized by replacing an [Embedding](Embedding.md) with the [Embedding](Embedding.md) of `[MASK]` - transformer [Layers](Layers.md) recognize `[MASK]` - ’s [Embedding](Embedding.md) as a special [Embedding](Embedding.md) that needs to be “filled in”, independent of the [Modality](Modality.md), by attending to other vectors in the layer inputs - unlike other architectures (LXMERT, UNiTER, ViLBERT, VLP, VL-BERT, VisualBERT, etc.), image masking is not based on image regions detected by the object detector, but a shallow CNN as an image embedder which is much more lightweight than deep models like ResNet and is designed to be masking friendly - [MLM](MLM.md) + RECON losses apply only to the masked text/image areas and measure reconstructed text and image quality. - no specific alignment loss - [Modality] Aware Masking (MAM) to boost cross-[modality](Modality] Aware Masking (MAM) to boost cross-[modality.md) interaction and take advantage of MLM and RECON losses that separately capture text and image reconstruction quality - Since the the task of finding closely-matching (CM) item pairs requires a pair of image+text inputs, they exploit this multi-[Modality](Modality.md) by employing [Modality Dropout](Modality%20Dropout.md) - text-only, image-only, and image-text mode - However, RECON instead of [ITM Loss](ITM%20Loss.md) offers better PR AUC - Similarly, using the [ITM Loss](ITM%20Loss.md) together with MLM and RECON does not change the performance