SlideShare a Scribd company logo
1 of 34
Pattern Mining to  Chinese Unknown word Extraction 資工碩二  955202037  楊傑程 2008/08/12
Outline ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Introduction ,[object Object],[object Object],[object Object]
Introduction ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Introduction- types of unknown words ,[object Object],Types of Chinese unknown words Organization names Ex: 華碩電腦 Ex: 總經理、電腦化 Abbreviation Proper Names Ex:  中油、中大 Personal names Ex:  王小明 Derived Words Compounds Ex: 電腦桌、搜尋法 Numeric type  compounds Ex: 1986 年、 19 巷
Introduction- unknown word identification ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Introduction- unknown word identification ,[object Object],[object Object],[object Object],[object Object],[object Object]
Introduction- detection and extraction ,[object Object],[object Object]
Introduction- applied techniques  ,[object Object],[object Object],[object Object]
Related Works- particular methods ,[object Object],[object Object],[object Object],[object Object]
Related Works- general methods  (Rule-based) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Related Works- general methods  (Statistical Model-based) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Related Works – Data ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Unknown Word Detection & Extraction ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Unknown Word Detection ,[object Object],[object Object]
Initial  segmentation Dictionary  (Libtabe lexicon ) POS tagging -TnT Unknown word detection Detection rules Pattern  Mining to derive detection rules Training data  (8/10 balanced corpus) Phase2 training data label Testing 2 ( un-segmented )  (1/10 balanced corpus) Initial  segmentation POS tagging -TnT Phase1 Training Phase1 Testing
Unknown word detection- Pattern Mining ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Unknown word detection- Continuity Pattern Mining ,[object Object],[object Object],[object Object],[object Object]
Encoding ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Create detection rules ,[object Object],( 葡  (Na) ,  萄  Y) : 1 ( 葡  (Na) ,  萄  Y) : 1
Store data (term + term_attribute  + POS) Phase2 training data Sliding Window Positive example: Find BIES Negative example: Learn and drop SVM model 2-gram SVM model 3-gram SVM model 4-gram Calculate term  frequency per docs   SVM training Models (3) Calculate  Precision /Recall Correct  segmentation 1/10 balanced corpus Merging evaluation Solve  overlap and conflict  (SVM) Sequential data
Unknown Word Extraction ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Positive / Negative Judgment ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Data Processing- Sliding Window ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
EX: 3-gram Model discard negative negative negative positive 運動會 ()  ‧ ()  四年 ()  甲班 ()  王 (?)  姿 (?)  分 (?)  ‧ ()  本校 ()  為 ()  響 ()  應 () 運動會 ‧ 四年 甲班 王 (?) ‧ 四年 甲班 王 (?) 姿 (?) 四年 甲班 王 (?) 姿 (?) 分 (?) 甲班 王 (?) B 姿 (?) I 分 (?) E ‧ 王 (?) 姿 (?) 分 (?) ‧ 本校
Statistical Information ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],prefix (0) t1 t2 t3 suffix (4)
Experiments ,[object Object],[object Object]
Unknown Word Detection ,[object Object],[object Object],[object Object],[object Object],Threshold (Accuracy) Precision Recall F-measure (our system) F-measure (AS system) 0.7 0.9324 0.4305 0.589035 0.71250 0.8 0.9008 0.5289 0.66648 0.752447 0.9 0.8343 0.7148 0.769941 0.76955 0.95 0.764 0.8288 0.795082 0.76553 0.98 0.686 0.8786 0.770446 0.744036
Unknown Word Extraction ,[object Object],[object Object],[object Object]
Unknown Word Extraction ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Testing result ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
SVM testing result ,[object Object],N-gram F1 score Precision Recall Only 4-gram 0.164 0.1  0.57 Only 3-gram 0.377 0.257  0.70 Only 2-gram 0.587 0.492  0.73 Three n-gram models combined 0.524 0.457  0.614
Ongoing Experiments ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],inst#  actual  predicted  error  prediction  1  2:-1  2:-1  -  0.984  2  1:1  1:1  -  0.933  …………………………………………… .. 116  2:-1  1:1  +  0.505
0.75 0.688 0.825 Bagging (SMO) Confidence=0.97  + all p 3 0.743 0.674 0.829 Libsvm Confidence=0.97 + all p 3 0.72 0.722 0.717 Libsvm P:N= 1:4 3 0.678 0.674 F-Measure 0.612 0.716 Recall Precision 0.759 0.637 Result Libsvm Libsvm Algorithm (inside) Confidence=0.95 + error + all p P:N = 1:2 Sample By 2 2 Gram

More Related Content

What's hot

Machine Learning Applications in NLP.ppt
Machine Learning Applications in NLP.pptMachine Learning Applications in NLP.ppt
Machine Learning Applications in NLP.ppt
butest
 

What's hot (20)

A Brief Introduction to Type Constraints
A Brief Introduction to Type ConstraintsA Brief Introduction to Type Constraints
A Brief Introduction to Type Constraints
 
OUTDATED Text Mining 5/5: Information Extraction
OUTDATED Text Mining 5/5: Information ExtractionOUTDATED Text Mining 5/5: Information Extraction
OUTDATED Text Mining 5/5: Information Extraction
 
Learning sets of rules, Sequential Learning Algorithm,FOIL
Learning sets of rules, Sequential Learning Algorithm,FOILLearning sets of rules, Sequential Learning Algorithm,FOIL
Learning sets of rules, Sequential Learning Algorithm,FOIL
 
Lecture: Summarization
Lecture: SummarizationLecture: Summarization
Lecture: Summarization
 
PROLOG: Introduction To Prolog
PROLOG: Introduction To PrologPROLOG: Introduction To Prolog
PROLOG: Introduction To Prolog
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)
 
UWB semeval2016-task5
UWB semeval2016-task5UWB semeval2016-task5
UWB semeval2016-task5
 
Framester and WFD
Framester and WFD Framester and WFD
Framester and WFD
 
Introduction to Text Mining
Introduction to Text Mining Introduction to Text Mining
Introduction to Text Mining
 
1909 paclic
1909 paclic1909 paclic
1909 paclic
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
 
Array
ArrayArray
Array
 
2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories2010 PACLIC - pay attention to categories
2010 PACLIC - pay attention to categories
 
RuleML2015 PSOA RuleML: Integrated Object-Relational Data and Rules
RuleML2015 PSOA RuleML: Integrated Object-Relational Data and RulesRuleML2015 PSOA RuleML: Integrated Object-Relational Data and Rules
RuleML2015 PSOA RuleML: Integrated Object-Relational Data and Rules
 
Frames
FramesFrames
Frames
 
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
 
Crash-course in Natural Language Processing
Crash-course in Natural Language ProcessingCrash-course in Natural Language Processing
Crash-course in Natural Language Processing
 
Machine Learning Applications in NLP.ppt
Machine Learning Applications in NLP.pptMachine Learning Applications in NLP.ppt
Machine Learning Applications in NLP.ppt
 
Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)
 
Predicate calculus
Predicate calculusPredicate calculus
Predicate calculus
 

Similar to Unknown Word 08

Moore_slides.ppt
Moore_slides.pptMoore_slides.ppt
Moore_slides.ppt
butest
 
Machine Learning and Inductive Inference
Machine Learning and Inductive InferenceMachine Learning and Inductive Inference
Machine Learning and Inductive Inference
butest
 
SNLI_presentation_2
SNLI_presentation_2SNLI_presentation_2
SNLI_presentation_2
Viral Gupta
 
MACHINE LEARNING-LEARNING RULE
MACHINE LEARNING-LEARNING RULEMACHINE LEARNING-LEARNING RULE
MACHINE LEARNING-LEARNING RULE
DrBindhuM
 
Resume_Clasification.pptx
Resume_Clasification.pptxResume_Clasification.pptx
Resume_Clasification.pptx
MOINDALVS
 

Similar to Unknown Word 08 (20)

Moore_slides.ppt
Moore_slides.pptMoore_slides.ppt
Moore_slides.ppt
 
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy GryshchukGrammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
 
Machine Learning and Inductive Inference
Machine Learning and Inductive InferenceMachine Learning and Inductive Inference
Machine Learning and Inductive Inference
 
Arabic question answering ‫‬
Arabic question answering ‫‬Arabic question answering ‫‬
Arabic question answering ‫‬
 
Spoken Content Retrieval
Spoken Content RetrievalSpoken Content Retrieval
Spoken Content Retrieval
 
A Meaning-Based Statistical English Math Word Problem Solver.pdf
A Meaning-Based Statistical English Math Word Problem Solver.pdfA Meaning-Based Statistical English Math Word Problem Solver.pdf
A Meaning-Based Statistical English Math Word Problem Solver.pdf
 
Myanmar Named Entity Recognition with Hidden Markov Model
Myanmar Named Entity Recognition with Hidden Markov ModelMyanmar Named Entity Recognition with Hidden Markov Model
Myanmar Named Entity Recognition with Hidden Markov Model
 
Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...
 
SNLI_presentation_2
SNLI_presentation_2SNLI_presentation_2
SNLI_presentation_2
 
MACHINE LEARNING-LEARNING RULE
MACHINE LEARNING-LEARNING RULEMACHINE LEARNING-LEARNING RULE
MACHINE LEARNING-LEARNING RULE
 
B017441015
B017441015B017441015
B017441015
 
Named Entity Recognition for Telugu Using Conditional Random Field
Named Entity Recognition for Telugu Using Conditional Random FieldNamed Entity Recognition for Telugu Using Conditional Random Field
Named Entity Recognition for Telugu Using Conditional Random Field
 
GDSC SSN - solution Challenge : Fundamentals of Decision Making
GDSC SSN - solution Challenge : Fundamentals of Decision MakingGDSC SSN - solution Challenge : Fundamentals of Decision Making
GDSC SSN - solution Challenge : Fundamentals of Decision Making
 
Inteligencia artificial
Inteligencia artificialInteligencia artificial
Inteligencia artificial
 
columbia-gwu
columbia-gwucolumbia-gwu
columbia-gwu
 
introduction to machine learning and nlp
introduction to machine learning and nlpintroduction to machine learning and nlp
introduction to machine learning and nlp
 
Resume_Clasification.pptx
Resume_Clasification.pptxResume_Clasification.pptx
Resume_Clasification.pptx
 
Lecture20 xing
Lecture20 xingLecture20 xing
Lecture20 xing
 
NLP
NLPNLP
NLP
 
Generation of Descriptive Elements for Text
Generation of Descriptive Elements for TextGeneration of Descriptive Elements for Text
Generation of Descriptive Elements for Text
 

Recently uploaded

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Recently uploaded (20)

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 

Unknown Word 08

  • 1. Pattern Mining to Chinese Unknown word Extraction 資工碩二 955202037 楊傑程 2008/08/12
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16. Initial segmentation Dictionary (Libtabe lexicon ) POS tagging -TnT Unknown word detection Detection rules Pattern Mining to derive detection rules Training data (8/10 balanced corpus) Phase2 training data label Testing 2 ( un-segmented ) (1/10 balanced corpus) Initial segmentation POS tagging -TnT Phase1 Training Phase1 Testing
  • 17.
  • 18.
  • 19.
  • 20.
  • 21. Store data (term + term_attribute + POS) Phase2 training data Sliding Window Positive example: Find BIES Negative example: Learn and drop SVM model 2-gram SVM model 3-gram SVM model 4-gram Calculate term frequency per docs SVM training Models (3) Calculate Precision /Recall Correct segmentation 1/10 balanced corpus Merging evaluation Solve overlap and conflict (SVM) Sequential data
  • 22.
  • 23.
  • 24.
  • 25. EX: 3-gram Model discard negative negative negative positive 運動會 ()  ‧ ()  四年 ()  甲班 ()  王 (?)  姿 (?)  分 (?)  ‧ ()  本校 ()  為 ()  響 ()  應 () 運動會 ‧ 四年 甲班 王 (?) ‧ 四年 甲班 王 (?) 姿 (?) 四年 甲班 王 (?) 姿 (?) 分 (?) 甲班 王 (?) B 姿 (?) I 分 (?) E ‧ 王 (?) 姿 (?) 分 (?) ‧ 本校
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34. 0.75 0.688 0.825 Bagging (SMO) Confidence=0.97 + all p 3 0.743 0.674 0.829 Libsvm Confidence=0.97 + all p 3 0.72 0.722 0.717 Libsvm P:N= 1:4 3 0.678 0.674 F-Measure 0.612 0.716 Recall Precision 0.759 0.637 Result Libsvm Libsvm Algorithm (inside) Confidence=0.95 + error + all p P:N = 1:2 Sample By 2 2 Gram