# MLM ([Masked Language Modeling](Masked Language Modeling.md))
- [from](https://publish.obsidian.md/fabian-groeger/Machine+Learning+%26+Deep+Learning/Deep+Learning/Architectures/Transformers/Masked+LM)
- 15% of the words in each sequence are replaced by `[MASK]`
- model tries to predict original values of the masked words
- uses the context provided by the other non-masked words in the sequences
- [loss](../Tag%20Pages/loss.md) function only considers the predictions of the masked words, ignores non-masked ones
- leads to slower convergence than with directional models
- additions to standard architecture:
- classification layer on top of the encoder output
- multiplying the encoders output vectors with the embedding matrix -> transforms them into the vocabulary dimension
- calculating probability of each word in the vocabulary using [Softmax](Softmax.md)