3. 毕业设计(论文)报告纸
中文分词算法设计与实现
摘 要
中文分词在计算机实现的重要问题是,中文在书写时并不会显式的给出单词边界,由于
单词是一个基本的语义单位,因此在中文信息分词处理的时候,必须先识别文本中的单词,
以便进行进一步的处理。本文结合作者工作实际分析了目前几种流行的中文分词算法,包括
MMSeg、统计分词算法,比较其各方面优劣和适用场景,并结合实际应用,给出利用 MySQL
的 UDF(User Defined Function,用户自定义函数)和插件机制,实现 MySQL FullText 索引
支持中文分词的方法。
关键词:中文分词,语言统计模型,MMSeg,MySQL
i
4. 毕业设计(论文)报告纸
Chinese Segmentation Algorithm Design and
Implementation
Abstract
An important problem in computational analysis of Chinese text is that there are no word
boundaries in conventionally printed text. Since the word is a fundamental linguistic unit, it is
necessary to identify words in Chinese text so that higher-level analyses can be performed. This
article is based on the analysis of the current practice of several popular Chinese segmentation
algorithms, including MMSEG and statistical segmentation algorithm. By comparing their good
point, weak point, and application scene, and by combing with actual application, UDF of
MySQL(User Defined Function) and plugin mechanisms, the MySQL FullText index supporting the
Chinese word segmentation method has been realized.
Key Words: Chinese word segmentation; Statistical language models; MMSeg; MySQL
ii