# Language Identification
- Identifying the language of the document
- Documents could be multilingual at the sentence level or paragraph level too
- [Unique Character Set](Unique%20Character%20Set.md)
- [Shared Character Set](Shared%20Character%20Set.md)
- Byte Range Distribution used for Character Set Identification
- sort the bytes in a file by frequency count and use the sorted list as a signature vector for comparison via an n-gram model