## 结论
可以注意下主流的[[形态素解析]]工具的实现细节,另外这波啊,是[[Sudachi]]的胜利!2333
其实从[[《形态素解析的理论和实现》]]来看,真正关键的其实还是辞典里面记录的链接值,这个可以用深度学习来搞,但前提是得有足够丰富和准确的数据。
## 细节
[ginza](https://github.com/megagonlabs/ginza/blob/develop/ginza/analyzer.py)
```python
if self.output_format in ["2", "mecab"]:
# 这个模式下面的是
nlp = try_sudachi_import(self.split_mode)
else:
# Work-around for pickle error. Need to share model data.
if self.model_name_or_path:
nlp = spacy.load(self.model_name_or_path)
else:
try:
nlp = spacy.load("ja_ginza_electra")
except IOError as e:
try:
// TODO
nlp = spacy.load("ja_ginza")
except IOError as e:
try:
nlp = spacy.load("ja_ginza_bert_large")
except IOError as e:
raise OSError("E050", 'You need to install "ja-ginza" or "ja-ginza-electra" by executing `pip install ja-ginza` or `pip install ja-ginza-electra`.')
```
[spacy](https://github.com/explosion/spaCy/blob/master/spacy/lang/ja/__init__.py)
```python
class JapaneseTokenizer(DummyTokenizer):
def __init__(self, vocab: Vocab, split_mode: Optional[str] = None) -> None:
self.vocab = vocab
self.split_mode = split_mode
# 这里也能说明
self.tokenizer = try_sudachi_import(self.split_mode)
# if we're using split mode A we don't need subtokens
self.need_subtokens = not (split_mode is None or split_mode == "A")
```