Mecab - NoHeartPen's Digital Garden

[[《形态素解析的理论和实现》]] （ひとり）す… すすすっかり打ち解けて [[UniDic]] 形态素解析词典 ## 待处理笔记 ```python import MeCab def get_yomi_to_ruby(tagger: MeCab, input_text: str) -> str: """通过 Mecab 获取单词读音 Args: tagger: Mecab 配置 input_text: 输入的句子 Returns: ruby_output: HTML 的 ruby 标签标注了读音的文字 """ result = tagger.parse(input_text) ruby_output = "" for line in result.splitlines()[:-1]: # 排除最后一行（通常是空行） parts = line.split("\t") # Tab 分隔 surface = parts[0] # 原文 feature = parts[1].split(",") # 特征分割 # 获取读音，如果没有则使用原文 reading = feature[7] if len(feature) > 7 else surface # 构建 HTML <ruby> 和 <rt> 标签 ruby_output += f"<ruby>{surface}<rt>{reading}</rt></ruby>" return ruby_output ``` ```python from fastapi import FastAPI, Form from fastapi.responses import HTMLResponse from fastapi.staticfiles import StaticFiles import MeCab from tools.get_yomi import get_yomi_to_ruby app = FastAPI() # 创建 MeCab 的解析器 mecab = MeCab.Tagger() # HTML 表单页面 HTML_FORM = """ <!DOCTYPE html> <html lang="ja"> <head> <meta charset="UTF-8"> <title>日语注音生成器</title> </head> <body> <h1>请输入日语文本进行注音</h1> <form action="/annotate" method="post"> <textarea name="text" rows="4" cols="50" required></textarea><br><br> <input type="submit" value="生成注音"> </form> </body> </html> """ # 处理根目录请求，显示 HTML 表单 @app.get("/", response_class=HTMLResponse) async def read_root(): return HTML_FORM # 处理注音请求 @app.post("/annotate", response_class=HTMLResponse) async def annotate(text: str = Form(...)): # 解析文本并生成 Ruby 输出 tagger = MeCab.Tagger("") ruby_output = get_yomi_to_ruby(tagger, text) # 返回结果页面 return f""" <!DOCTYPE html> <html lang="ja"> <head> <meta charset="UTF-8"> <title>注音结果</title> </head> <body> <h1>注音结果</h1> <div>{ruby_output}</div> <br> <a href="/">返回</a> </body> </html> """ # 运行应用 # 你可以在终端中使用 `uvicorn main:app --reload` 来运行这个应用 ``` ## 笔记 [[Mecab 安装存档]] [浅入深出 Mecab：日语自然语言处理入门](浅入深出%20Mecab：日语自然语言处理入门.md) 这个还是要提一下。 [[Mecab 实战]] 区别就是这个会有大量实际应用场景。 [[自定义 Mecacb 输出格式]] 从根本上定义输出的数据。 ## 相关应用 [[自动辞书]] [Mecab 安装指北](Mecab%20安装指北.md) [[UniDic]] [[FastCorpus]] ## 相关文章 https://github.com/taku910/mecab [[MeCab Yet Another Part-of-Speech and Morphological Analyzer]] 项目的官网，快速了解的最好方式。 [[MeCab の開発経緯]] 作者本人写的介绍。 [[Taku Kudo]] 开发者博客，有大量的相关论文。 ## 许可证第三个许可证可以商用。 >MeCab is copyrighted free software by Taku Kudo <[email protected]> andNippon Telegraph and Telephone Corporation, and is released underany of the GPL (see the file GPL), the LGPL (see the file LGPL), or the BSD License (see the file BSD).