This document provides an overview of the OpenNLP natural language processing tool. It discusses the various NLP tasks that OpenNLP can perform, including tokenization, POS tagging, named entity recognition, chunking, parsing, and co-reference resolution. It also describes how models for these tasks are trained in OpenNLP using annotated training data. The document concludes by listing some advantages and limitations of OpenNLP.
5. Huge amount of Data
Classify text into Categories
Index and Search Large Text
Automatic Translation
Speech Understanding
Information Extraction
Automatic Summarization Question Answering
Natural Language
Processing
6. “Natural Language Processing is a theoretically
motivated range of computational techniques for
analyzing and representing naturally occurring texts
at one or more levels of linguistic analysis for the
purpose of achieving human-like language
processing for a range of tasks or applications”
(Liddy et al.,2001)
Natural Language: Refers to the language spoken by people
eg. English, Hindi etc. Opposed to artificial Language like Java
7. Computer Science
Database AI Algorithms …
Robotics NLP Search
Information Retrieval Language Analysis Translation
10. Text Based Application
Dialogue Based Application
Speech Recognition (E.g. IBM VoiceType Dictation)
Spoken Language System(E.g. Dragon, Operetta)
Language Translation
15. Electronic text is a linear sequence of Symbols
Before any real text processing text need to be segmented
This is Tokenization. theThis segments sentence
SegmentedText
Abbreviation
Hyphenated Words
Numerical and Spl. Exp
16. Electronic text is a linear sequence of Symbols
Before any real text processing text need to be segmented
This
is
Tokenization.
the
This
segmentssentenceSegmentedText
Abbreviation
Hyphenated Words
Numerical and Spl. Exp
17. NLPTask
POSTagging
POS Tagging is the process of marking up a word in a text as
corresponding to a particular part of speech, based on both
its definition, as well as its context
18. POST- grammatical tagging or word-category disambiguation
Identification of words as nouns, verbs, adjectives, adverbs…
CC
CD
DT
FW
JJ
JJR
NN
Co-conjuction
Cardinal Num
Determiner
Foreign Words
Adjective
Adj.Com
Noun
VB
VBD
RB
RBR
RBS
SYM
NNP
Verb
Verb,Past
Adverb
Adverb Com.
Adverb S.
Symbol
Proper N.
20. NLPTask
Name Entity Extraction
Named-entity recognition (NER) is a subtask of information
extraction that seeks to locate and classify elements in text into
pre-defined categories such as the names of persons,
organizations, locations, expressions of times, quantities,
monetary values, percentages, etc.
27. OpenNLP is a library for Natural Language Processing
Open Source and Developed by Apache Foundation
Stable Release 1.5.3 in 2013
Java Based and Cross Platform
28. OpenNLP is capable of doing NLP task
OpenNLP provides API’s for NLP task
Text………
……………
……………
…End
Segmentation
POSTagging
Tokenization NER
ChunkingParing
Co-reference
resolution
36. Tokenization
Whitespace Simple Learnable
A whitespace tokenizer, non whitespace sequences are identified as tokens
A character class tokenizer, sequences of the same character class are tokens
A maximum entropy tokenizer, detects token boundaries based on probability model
37.
38.
39.
40. It expects a tokenized sentence as input, which is represented as a String array
Each String object in the array is one token
The POS tags associated with each token
41.
42. Document Categorizer Classify text into Predefined
Category
Based on the Maximum Entropy Model
Unlike Other Task OpenNLP Does Not Provide Predefined Model for
Document Categorization
To use this facility Build Model
43.
44. Open a sample data stream
SentenceDetectorME.train
Save the SentenceModel
45. Open a sample data stream
TokenizerME.train
SaveTokenizerModel
46. The application must open a sample data stream
Call the POSTagger.train method
The application must open a sample data stream
Training Data Format: About_IN 10_CD Euro_NNP
47. The Parser can be trained on annotated training
material
The data can be in OpenNLP Format
:Training Data Format:
(TOP (S (NP-SBJ (DT Some) )(VP (VBP say) (NP (NNP November) ))(. .) ))
(TOP (S (NP-SBJ (PRP I) )(VP (VBP say) (NP (CD 1992) ))(. .) ('' '') ))
48. The Document Categorizer can be trained on annotated
training material
The data can be in OpenNLP Document Categorizer
Training Format
:Training Data Format:
Computer Science is the study of computers and computational
systems. Unlike electrical and computer engineers,
computer scientists deal mostly with software and
software systems; this includes their theory, design
development, and application.
57. Open Source Tool
Easy to Install and Use
Multilingual Model Facility(English, Spanish, Thai etc.)
Easy Development of Model
Cross Platform
Document categorization
58.
59. References:
Avram, S., Caragea, D. and Borangiu, T.(2014). NLP applications in
external plagiarism detection. U.P.B. Sci. Bull., Series C,
76(3):29-36.
Benjamin, C. M. X. , Mahmud, R. , Qiang, L., Sadanandan, A. A.,
Onn, K. W. and Lukose, D.(2014). “Malay Semantic Text
Processing Engine”, In the Proceedings of the International
Conference of Conference on Information, Process, and
Knowledge Management. pp.38-43.
Liu, F., Vasardani,M. and Baldwin,T.(2012) Automatic Identification
of Locative Expressions from Social Media Text: A
Comparative Analysis. International Journal of Computer
Applications,10, 150-156.
61. References:
Liddy, E. D.(2011). Natural Language Processing In: Encyclopedia
of Library and Information Science, 2nd Ed. Marcel
Decker, Inc.pp. 362-386.
Michael, H., Jerald L., Huanying, G. Paolo, G.(2014).Privacy-
Preserving Symptoms-to-Disease Mapping on Smartphones
. Mobile and Information Technologies in Medicine,10,350-
354.