# MLIM
- [MLIM: Vision-and-language Model Pre-training with Masked Language and Image Modeling](https://arxiv.org/abs/2109.12178)
- Vision-and-Language Pre-training (VLP) improves model performance for downstream tasks that require image and text inputs
- Typically, in addition to the [Masked Language Modeling] (MLM) loss, alignment-based objectives are used for cross-[modality](Masked Language Modeling] (MLM) loss, alignment-based objectives are used for cross-[modality.md) interaction, and RoI feature regression and classification tasks for Masked ImageRegion Modeling (MIRM)
- Alignment-based objectives require pairings of image and text and heuristic objective functions
- Masking policies either do not take advantage of multi-[Modality](Modality.md) or are strictly coupled with alignments generated by other models
- pre-trained using two pre-training tasks as a multi-loss objective given a mini-batch of image-text pairs: [Masked Language Modeling] (MLM) loss (as in BERT) for text, and image reconstruction (RECON) loss for image, coupled with [Modality](Masked Language Modeling] (MLM) loss (as in BERT) for text, and image reconstruction (RECON) loss for image, coupled with [Modality.md) Aware Masking (MAM)
- determines the masking probability and applies masking to both word and image [Embedding](Embedding.md)
- based on BERT predict the masked words from available words and image regions
- follow BERT for this task: two-layer MLP MLM head outputting [Logits](Logits.md) over the vocabulary
- MLM loss is negative log-likelihood for masked word
- RECON loss is an an average of pixel-wise sum of squared errors (SSE)
- Both image and word masking is realized by replacing an [Embedding](Embedding.md) with the [Embedding](Embedding.md) of `[MASK]`
- transformer [Layers](Layers.md) recognize `[MASK]`
- ’s [Embedding](Embedding.md) as a special [Embedding](Embedding.md) that needs to be “filled in”, independent of the [Modality](Modality.md), by attending to other vectors in the layer inputs
- unlike other architectures (LXMERT, UNiTER, ViLBERT, VLP, VL-BERT, VisualBERT, etc.), image masking is not based on image regions detected by the object detector, but a shallow CNN as an image embedder which is much more lightweight than deep models like ResNet and is designed to be masking friendly
- [MLM](MLM.md) + RECON losses apply only to the masked text/image areas and measure reconstructed text and image quality.
- no specific alignment loss
- [Modality] Aware Masking (MAM) to boost cross-[modality](Modality] Aware Masking (MAM) to boost cross-[modality.md) interaction and take advantage of MLM and RECON losses that separately capture text and image reconstruction quality
- Since the the task of finding closely-matching (CM) item pairs requires a pair of image+text inputs, they exploit this multi-[Modality](Modality.md) by employing [Modality Dropout](Modality%20Dropout.md)
- text-only, image-only, and image-text mode
- However, RECON instead of [ITM Loss](ITM%20Loss.md) offers better PR AUC
- Similarly, using the [ITM Loss](ITM%20Loss.md) together with MLM and RECON does not change the performance