trie 字典树 - NoHeartPen's Digital Garden

算是[[算法与数据结构]]这篇文章主要就是想谈一个检索，[[易查]]和[《电子词典检索功能分析及其发展构想》](《电子词典检索功能分析及其发展构想》.md)里面提到的递进式检索应该就是这种模式，但问题是用[SQL 手账](SQL%20手账.md)的like模糊查询不就行了么，为什么要用这个地方提到的复杂算法呢，[非辞書](非辞書.md)也许就可以借助下这里的算法实现一个渐进式的搜索建议。另外，[GPT](GPT.md)也建议自己使用这个数据结构233 [[鹅厂面试题，英语单词拼写检查算法？]] 主要是看到了这篇文章，如果是做编辑器的话，这个东西算是 https://en.wikipedia.org/wiki/Trie# # 原文 Suppose you want to build a predictive text engine. Given a few letters, you want to predict the end of a word. Suppose we have the string "co". The next word could be: 假设您要构建一个预测文本引擎。给定几个字母，您想预测一个单词的结尾。假设我们有字符串“co”。下一个词可能是： * Cobalt 钴 * Code 法典 * Coffee 咖啡 * Co-operate 䢒 Or many other words. 或者许多其他词。 Given a dictionary of all words in the English language, you could find all the words that start with "co" and pick one to recommend. But how would you pick one to recommend? 给定英语中所有单词的词典，您可以找到所有以“co”开头的单词并选择一个进行推荐。但是你会如何选择一个推荐呢？ A good answer to this is to use word probabilities. You could have a reference text with a range of different texts from which you calculate how likely a word is to be used. You could tune this based on actual usage, too. If someone types "cobalt" a lot (I don't know why they would, but okay :D), you could rank"Cobalt"higher when choosing a word to recommend for the string"co". 一个很好的答案是使用单词概率。你可以有一个包含一系列不同文本的参考文本，从中可以计算出一个单词被使用的可能性。您也可以根据实际使用情况进行调整。如果有人经常输入“cobalt”（我不知道他们为什么会，但没关系:D），在为字符串“co”选择推荐的单词时，您可以将“Cobalt”排得更高。 We are missing one piece: how do we do this efficiently? 我们缺少一件东西：我们如何有效地做到这一点？ That is where the trie comes in, my favourite data structure. 这就是 trie 的用武之地，我最喜欢的数据结构。 What is a trie? 什么是尝试？ ------------------------ The trie is a deeply nested tree. The structure is commonly used for predictive text engines because you can represent all strings as a tree. Then, you can use word probabilities to decide on what word you should recommend. This could be done by using a reference corpus of millions of words from English publications to identify how common dfifferent words are. trie 是一棵深嵌套的树。该结构通常用于预测文本引擎，因为您可以将所有字符串表示为树。然后，您可以使用单词概率来决定应该推荐哪个单词。这可以通过使用来自英语出版物的数百万个单词的参考语料库来完成，以确定 dfifferent 单词的常见程度。 For next word prediction, a trie might be built at the letter level. This means that every letter has a value and also links to every other possible letter that could come next. 对于下一个单词预测，可以在字母级别构建 trie。这意味着每个字母都有一个值，并且还链接到接下来可能出现的所有其他可能的字母。 You can add items to, search for items, and remove items from a trie. 您可以向试用中添加项目、搜索项目以及从尝试中删除项目。 A trie, step by step 一步一步的尝试 ------------------------------ Let's mock up a trie to see how it works! 让我们模拟一下，看看它是如何工作的！ Our trie will contain two words: 我们的 trie 将包含两个词： * Cobalt 钴 * Code 法典 Consider the following trie: 请考虑以下尝试： Every possible letter is a key in our trie. Above, our trie is represented as a dictionary. In a real implementation, a trie might be represented as a set of classes, as is common in tree-based algorithms. 每一个可能的字母都是我们尝试的关键。在上面，我们的 trie 表示为字典。在实际实现中，trie 可以表示为一组类，这在基于树的算法中很常见。 Every key has two possible properties: 每个键都有两个可能的属性： 1. `*`, which is the probability the word should be recommended, and; `*` ，这是该词应该被推荐的概率，以及; 2. A dictionary that contains all possible next letters, or a number if there are no more letters. 包含所有可能的下一个字母的字典，如果没有更多字母，则包含一个数字。 Suppose we only want to get words that start with "co". Here is some pseudo code about how we would do this if our Trie was a dictionary: 假设我们只想得到以“co”开头的单词。以下是一些伪代码，说明如果我们的 Trie 是一本字典，我们将如何做到这一点： We could then recurse down all of the keys in "o" to calculate the next possible words. In this example, we can make two words: 然后，我们可以递归“o”中的所有键来计算下一个可能的单词。在这个例子中，我们可以做两个词： * Cobalt 钴 * Code 法典 We could refine our search by adding another character: 我们可以通过添加另一个字符来优化我们的搜索： The value of this would be: 其值为： The `*` whose value is 0.1 represents "cod" (as in the fish). The "e" tells us there is one more letter that can follow "cod": "e". The value for this is "7.3". `*` 值为 0.1 表示“鳕鱼”（如鱼）。“e”告诉我们还有一个字母可以跟在“cod”后面：“e”。此值为“7.3”。 Now, let's say we traverse our whole tree for"co" that we made earlier. We would end up with two values: * cobalt: 0.9 * code: 7.3 We can order these and choose the one with the highest probability for our next word. Let's make a trie! (Python edition) ----------------------------------- If you want to use a Trie in Python, I recommend the [pygtrie](https://pygtrie.readthedocs.io/en/latest/) library, originally developed by Google. This library has pre-built utilities for accessing items in a trie, traversing the Trie, and more. First, run: Then, create a new file with the following code: In this code, we create a trie with three elements: * "code" * "cobalt" * "coffee" At the end of the code, we search the trie to find all words that start with "co". Our code returns: We can sort the words by probability using this code: Here is the result: Conclusion and resources ------------------------ The trie is a tree data structure. In a trie, every node has a value for the node itself and a set of nodes that you can traverse. You can represent words as a trie to efficiently find all words that start with a given prefix. Because you can give each node a value, you can attach probabilities to each word in the trie. This is ideal for next word prediction, where you can extract candidates for a next word from the trie using letters someone has already typed and then rank them based on the probabilities associated with each node. You can also find if a word is not in the trie efficiently. To do so, you would split up a word (i.e. "codep") into its letters and search for each one `trie["c"]["o"]["d"]["e"]["p"]`. If there is no value associated with the result, you know the word is not in the tree. For a detailed breakdown of the trie data structure, refer to the [Trie Wikipedia page](https://en.wikipedia.org/wiki/Trie).