# Language Identification - Identifying the language of the document - Documents could be multilingual at the sentence level or paragraph level too - [Unique Character Set](Unique%20Character%20Set.md) - [Shared Character Set](Shared%20Character%20Set.md) - Byte Range Distribution used for Character Set Identification - sort the bytes in a file by frequency count and use the sorted list as a signature vector for comparison via an n-gram model