The document discusses using pattern mining techniques to detect and extract unknown words from Chinese text. It begins with an introduction of the Chinese word segmentation problem and types of unknown words. It then discusses related work on particular and general unknown word extraction methods. The document proposes applying continuity pattern mining to detect unknown words, and using sequential supervised learning and machine learning algorithms to extract unknown words based on natural language and statistical information. Experimental results show the approach achieves better performance than rule-based methods.
1. Pattern Mining to Chinese Unknown word Extraction 資工碩二 955202037 楊傑程 2008/08/12
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16. Initial segmentation Dictionary (Libtabe lexicon ) POS tagging -TnT Unknown word detection Detection rules Pattern Mining to derive detection rules Training data (8/10 balanced corpus) Phase2 training data label Testing 2 ( un-segmented ) (1/10 balanced corpus) Initial segmentation POS tagging -TnT Phase1 Training Phase1 Testing
17.
18.
19.
20.
21. Store data (term + term_attribute + POS) Phase2 training data Sliding Window Positive example: Find BIES Negative example: Learn and drop SVM model 2-gram SVM model 3-gram SVM model 4-gram Calculate term frequency per docs SVM training Models (3) Calculate Precision /Recall Correct segmentation 1/10 balanced corpus Merging evaluation Solve overlap and conflict (SVM) Sequential data