SlideShare una empresa de Scribd logo
1 de 30
Part of Speech (POS) Tagging
K.A.S.H. Kulathilake
B.Sc.(Sp.Hons.)IT, MCS, Mphil, SEDA(UK)
Closed Class Vs. Open Class
• Parts-of-speech can be divided into two broad categories:
closed class types and open class types.
• Closed classes are those with relatively fixed membership,
such as prepositions—new prepositions are rarely coined.
• By contrast, nouns and verbs are open classes—new nouns
and verbs like iPhone or to fax are continually being created
or borrowed.
• Any given speaker or corpus may have different open class
words, but all speakers of a language, and sufficiently large
corpora, likely share the set of closed class words.
• Closed class words are generally function word words like
of, it, and, or you, which tend to be very short, occur
frequently, and often have structuring uses in grammar.
Closed Class Vs. Open Class (Cont…)
• Open Classes
– Four major open classes occur in the languages of
the world:
• nouns,
• verbs,
• adjectives and
• adverbs.
Closed Class Vs. Open Class (Cont…)
• Nouns
– The syntactic class noun includes the words for most
people, places, or things, but others as well.
– Nouns include concrete terms like ship and chair,
abstractions like bandwidth and relationship, and
verb-like terms like pacing as in His pacing to and fro
became quite annoying.
– What defines a noun in English, then, are things like
its ability to occur with determiners (a goat, its
bandwidth, Plato’s Republic), to take possessives
(IBM’s annual revenue), and for most but not all
nouns to occur in the plural form (goats, abaci).
Closed Class Vs. Open Class (Cont…)
– Open class nouns fall into two classes.
– Proper nouns:
• like Regina, Colorado, and IBM, are names of specific persons or entities.
• In English, they generally aren’t preceded by articles (e.g., the book is upstairs,
but Regina is upstairs).
• In written English, proper nouns are usually capitalized.
– Common nouns:
• Common nouns are divided in many languages, including English, into count
nouns and mass nouns.
• Count nouns allow grammatical enumeration, occurring in both the singular
and plural (goat/goats, relationship/relationships) and they can be counted
(one goat, two goats).
• Mass nouns are used when something is conceptualized as a homogeneous
group.
• So words like snow, salt, and communism are not counted (i.e., *two snows or
*two communisms).
• Mass nouns can also appear without articles where singular count nouns
cannot (Snow is white but not *Goat is white).
Closed Class Vs. Open Class (Cont…)
• Verb
– The verb class includes most of the words
referring to actions and processes, including main
verbs like draw, provide, and go.
– English verbs have inflections (non-third-person-sg
(eat), third-person-sg (eats), progressive (eating),
past participle (eaten)).
Closed Class Vs. Open Class (Cont…)
• Adjective:
– A class that includes many terms for properties or
qualities.
– Most languages have adjectives for the concepts of
color (white, black), age (old, young), and value (good,
bad), but there are languages without adjectives.
– In Korean, for example, the words corresponding to
English adjectives act as a subclass of verbs, so what is
in English an adjective “beautiful” acts in Korean like a
verb meaning “to be beautiful”.
Closed Class Vs. Open Class (Cont…)
• Adverb:
– The final open class form, adverbs, is rather a
hodge-podge, both semantically and formally.
– In the following sentence from Schachter (1985)
all the italicized words are adverbs:
– Unfortunately, John walked home extremely
slowly yesterday.
Closed Class Vs. Open Class (Cont…)
– What coherence the class has semantically may be solely that
each of these words can be viewed as modifying something
(often verbs, hence the name “adverb”, but also other adverbs
and entire verb phrases).
– Directional adverbs or locative adverbs (home, here, downhill)
specify the direction or location of some action;
– degree adverbs (extremely, very, somewhat) specify the extent
of some action, process, or property;
– manner adverbs (slowly, slinkily, delicately) describe the manner
of some action or process;
– and temporal adverbs describe the time that some action or
event took place (yesterday, Monday).
– Because of the heterogeneous nature of this class, some
adverbs (e.g., temporal adverbs like Monday) are tagged in
some tagging schemes as nouns.
Closed Class Vs. Open Class (Cont…)
• Closed Class
– The closed classes differ more from language to
language than do the open classes.
– Some of the important closed classes in English
include:
• prepositions: on, under, over, near, by, at, from, to, with
• determiners: a, an, the
• pronouns: she, who, I, others
• conjunctions: and, but, or, as, if, when
• auxiliary verbs: can, may, should, are
• particles: up, down, on, off, in, out, at, by
• numerals: one, two, three, first, second, third
The Penn Treebank Part-of-Speech
Tagset
• While there are many lists of parts-of-speech, most
modern language processing on English uses the 45-tag
Penn Treebank tagset.
• Parts-of-speech are generally represented by placing
the tag after each word, delimited by a slash, as in the
following examples:
– The/DT grand/JJ jury/NN commented/VBD on/IN a/DT
number/NN of/IN other/JJ topics/NNS ./.
– There/EX are/VBP 70/CD children/NNS there/RB
– Preliminary/JJ findings/NNS were/VBD reported/VBN in/IN
today/NN ’s/POS New/NNP England/NNP Journal/NNP
of/IN Medicine/NNP ./.
The Penn Treebank Part-of-Speech
Tagset (Cont…)
The Penn Treebank Part-of-Speech
Tagset (Cont…)
• Corpora labeled with parts-of-speech like the Treebank
corpora are crucial training (and testing) sets for
statistical tagging algorithms.
• Three main tagged corpora are consistently used for
training and testing part-of-speech taggers for English:
– The Brown corpus is a million words of samples from 500
written texts from different genres published in the United
States in 1961.
– The WSJ corpus contains a million words published in the
Wall Street Journal in 1989.
– The Switchboard corpus consists of 2 million words of
telephone conversations collected in 1990-1991.
The Penn Treebank Part-of-Speech
Tagset (Cont…)
• The corpora were created by running an automatic
part-of-speech tagger on the texts and then human
annotators hand-corrected each tag.
• Tagging algorithms assume that words have been
tokenized before tagging.
• The Penn Treebank and the British National Corpus
split contractions and the ’s-genitive from their stems:
– would/MD n’t/RB
– children/NNS ’s/POS
• Indeed, the special Treebank tag POS is used only for
the morpheme ’s, which must be segmented off during
tokenization.
The Penn Treebank Part-of-Speech
Tagset (Cont…)
• Another tokenization issue concerns multipart
words.
• The Treebank tagset assumes that tokenization of
words like New York is done at whitespace.
• The phrase a New York City firm is tagged in
Treebank notation as five separate words: a/DT
New/NNP York/NNP City/NNP firm/NN.
• The C5 tagset for the British National Corpus, by
contrast, allow prepositions like “in terms of” to
be treated as a single word by adding numbers to
each tag, as in in/II31 terms/II32 of/II33.
POS Tagging
• Part-of-speech tagging (tagging for short) is the
process of assigning a part-of speech marker to
each word in an input text.
• Because tags are generally also applied to
punctuation, tokenization is usually performed
before, or as part of, the tagging process:
separating commas, quotation marks, etc., from
words and disambiguating end-of-sentence
punctuation (period, question mark, etc.) from
part-of-word punctuation (such as in
abbreviations like e.g. and etc.)
POS Tagging (Cont…)
• Tagging is a disambiguation task; words are ambiguous—
have more than one possible part-of-speech— and the goal
is to find the correct tag for the situation.
• For example, the word book can be a verb (book that flight)
or a noun (as in hand me that book).
• That can be a determiner (Does that flight serve dinner) or
a complementizer (I thought that your flight was earlier).
• The problem of POS-tagging is to resolve these ambiguities,
choosing the proper tag for the context.
• Part-of-speech tagging is thus one of the many
disambiguation tasks in language processing.
POS Tagging (Cont…)
• Here are some examples of the 6 different
parts-of-speech for the word back:
– earnings growth took a back/JJ seat
– a small building in the back/NN
– a clear majority of senators back/VBP the bill
– Dave began to back/VB toward the door
– enable the country to buy back/RP about debt
– I was twenty-one back/RB then
POS Tagging (Cont…)
• POS Techniques
– Rule-Based: Human crafted rules based on lexical and
other linguistic knowledge.
– Stochastic: Trained on human annotated corpora like
the Penn Treebank. Statistical models: Hidden Markov
Model (HMM), Maximum Entropy Markov Model
(MEMM), Conditional Random Field (CRF)
– Transformation Based Tagging: Generally, learning-
based approaches have been found to be more
effective overall, taking into account the total amount
of human expertise and efort involved.
HMM POS Tagging
• When we apply HMM to part-of-speech tagging we
generally don’t use the Baum-Welch algorithm for
learning the HMM parameters.
• Instead HMMs for part-of-speech tagging are trained
on a fully labeled dataset—a set of sentences with
each word annotated with a part-of-speech tag—
setting parameters by maximum likelihood estimates
on this training data.
• Thus the only algorithm we will need is the Viterbi
algorithm for decoding, and we will also need to see
how to set the parameters from training data.
The Basic Equation of HMM Tagging
• Let’s begin with a quick reminder of the intuition of HMM decoding.
• The goal of HMM decoding is to choose the tag sequence that is
most probable given the observation sequence of n words 𝑤1
𝑛
:
𝑡′1
𝑛
= 𝑎𝑟𝑔𝑚𝑎𝑥 𝑡1
𝑛 𝑃(𝑡1
𝑛
|𝑤1
𝑛
• by using Bayes’ rule to instead compute:
𝑡′1
𝑛
= 𝑎𝑟𝑔𝑚𝑎𝑥 𝑡1
𝑛
𝑃 𝑤1
𝑛
|𝑡1
𝑛
𝑃(𝑡1
𝑛
)
𝑃(𝑤1
𝑛
)
• Furthermore, we simplify above equation by dropping the
denominator 𝑃 𝑤1
𝑛
:
𝑡′1
𝑛
= 𝑎𝑟𝑔𝑚𝑎𝑥 𝑡1
𝑛 𝑃 𝑤1
𝑛
|𝑡1
𝑛
𝑃(𝑡1
𝑛
)
The Basic Equation of HMM Tagging
(Cont…)
• HMM taggers make two further simplifying assumptions.
• The first is that the probability of a word appearing
depends only on its own tag and is independent of
neighboring words and tags:
𝑃 𝑤1
𝑛
𝑡1
𝑛
≈
𝑖=1
𝑛
𝑃(𝑤𝑖|𝑡𝑖)
• The second assumption, the bigram assumption, is that the
probability of a tag is dependent only on the previous tag,
rather than the entire tag sequence;
𝑃(𝑡1
𝑛
) ≈
𝑖=1
𝑛
𝑃(𝑡𝑖|𝑡𝑖−1)
The Basic Equation of HMM Tagging
(Cont…)
• Using previous three equations we can derive following equation for the most
probable tag sequence from a bigram tagger, which as we will soon see,
correspond to the emission probability and transition probability for the HMM.
𝑡′1
𝑛
= 𝑎𝑟𝑔𝑚𝑎𝑥 𝑡1
𝑛 𝑃 𝑤1
𝑛
|𝑡1
𝑛
𝑃 𝑡1
𝑛
−→ (1)
𝑃 𝑤1
𝑛
𝑡1
𝑛
≈
𝑖=1
𝑛
𝑃 𝑤𝑖 𝑡𝑖 −→ (2)
𝑃 𝑡1
𝑛
≈
𝑖=1
𝑛
𝑃 𝑡𝑖 𝑡𝑖−1 −→ (3)
Apply (2) and (3) to (1)
𝑡′1
𝑛
= 𝑎𝑟𝑔𝑚𝑎𝑥 𝑡1
𝑛 𝑃 𝑡1
𝑛
𝑤1
𝑛
≈ 𝑎𝑟𝑔𝑚𝑎𝑥 𝑡1
𝑛
𝑖=1
𝑛
𝑃 𝑤𝑖 𝑡𝑖 𝑃 𝑡𝑖 𝑡𝑖−1
Where 𝑃 𝑤𝑖 𝑡𝑖 is emission probability and 𝑃 𝑡𝑖 𝑡𝑖−1 is transition probability.
Estimating Probabilities
• Let’s walk through an example, seeing how these
probabilities are estimated and used in a sample
tagging task, before we return to the Viterbi algorithm.
• In HMM tagging, rather than using the full power of
HMM EM learning, the probabilities are estimated just
by counting on a tagged training corpus.
• For this example we’ll use the tagged WSJ corpus.
• The tag transition probabilities 𝑃 𝑡𝑖 𝑡𝑖−1 represent
the probability of a tag given the previous tag.
Estimating Probabilities (Cont…)
• The maximum likelihood estimate of a transition probability is computed
by counting, out of the times we see the first tag in a labeled corpus, how
often the first tag is followed by the second
Example
• Let’s now work through an example of
computing the best sequence of tags that
corresponds to the following sequence of
words
• Janet will back the bill
• The correct series of tags is:
• Janet/NNP will/MD back/VB the/DT bill/NN
Example (Cont…)
Example (Cont…)
This table is (slightly simplified) from counts in the WSJ corpus.
So the word Janet only appears as an NNP, back has 4 possible parts of speech,
and the word the can appear as a determiner or as an NNP (in titles like “Somewhere
Over the Rainbow” all words are tagged as NNP).
Example (Cont…)
Example (Cont…)
• Following figure shows a schematic of the
possible tags for each word and the correct
final tag sequence.

Más contenido relacionado

La actualidad más candente

Language Model (N-Gram).pptx
Language Model (N-Gram).pptxLanguage Model (N-Gram).pptx
Language Model (N-Gram).pptx
HeneWijaya
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
Yasir Khan
 

La actualidad más candente (20)

NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language Model
 
Usage of regular expressions in nlp
Usage of regular expressions in nlpUsage of regular expressions in nlp
Usage of regular expressions in nlp
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
 
Tokenization using nlp | NLP Course
Tokenization using nlp | NLP CourseTokenization using nlp | NLP Course
Tokenization using nlp | NLP Course
 
Lecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language TechnologyLecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language Technology
 
Parts of speech tagger
Parts of speech taggerParts of speech tagger
Parts of speech tagger
 
Nlp
NlpNlp
Nlp
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
 
Language models
Language modelsLanguage models
Language models
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word Embeddings
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
 
Language Model (N-Gram).pptx
Language Model (N-Gram).pptxLanguage Model (N-Gram).pptx
Language Model (N-Gram).pptx
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Treebank annotation
Treebank annotationTreebank annotation
Treebank annotation
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Natural Language Processing seminar review
Natural Language Processing seminar review Natural Language Processing seminar review
Natural Language Processing seminar review
 

Similar a NLP_KASHK:POS Tagging

Principles of parameters
Principles of parametersPrinciples of parameters
Principles of parameters
Velnar
 
Vocabulary connections
Vocabulary connectionsVocabulary connections
Vocabulary connections
dannaet
 

Similar a NLP_KASHK:POS Tagging (20)

NLP_KASHK:Text Normalization
NLP_KASHK:Text NormalizationNLP_KASHK:Text Normalization
NLP_KASHK:Text Normalization
 
NLP_KASHK:Context-Free Grammar for English
NLP_KASHK:Context-Free Grammar for EnglishNLP_KASHK:Context-Free Grammar for English
NLP_KASHK:Context-Free Grammar for English
 
6 POS SA.pptx
6 POS SA.pptx6 POS SA.pptx
6 POS SA.pptx
 
Data collection and Materials Development
Data collection and Materials DevelopmentData collection and Materials Development
Data collection and Materials Development
 
5810 day 3 sept 20 2014
5810 day 3 sept 20 2014 5810 day 3 sept 20 2014
5810 day 3 sept 20 2014
 
2021-0509_JAECS2021_Spring
2021-0509_JAECS2021_Spring2021-0509_JAECS2021_Spring
2021-0509_JAECS2021_Spring
 
Universal grammar (ug)
Universal grammar (ug)Universal grammar (ug)
Universal grammar (ug)
 
Syntax course
Syntax courseSyntax course
Syntax course
 
CHAPTER 3. Lexeme Formation (Morphology (Linguistics)
CHAPTER 3. Lexeme Formation (Morphology (Linguistics)CHAPTER 3. Lexeme Formation (Morphology (Linguistics)
CHAPTER 3. Lexeme Formation (Morphology (Linguistics)
 
Ch 9 Language and Speech Processing.pptx
Ch 9 Language and Speech Processing.pptxCh 9 Language and Speech Processing.pptx
Ch 9 Language and Speech Processing.pptx
 
Principles of parameters
Principles of parametersPrinciples of parameters
Principles of parameters
 
Su 2012 ss orthography(1)
Su 2012 ss orthography(1)Su 2012 ss orthography(1)
Su 2012 ss orthography(1)
 
5810 day 3 sept 20 2014
5810 day 3 sept 20 2014 5810 day 3 sept 20 2014
5810 day 3 sept 20 2014
 
Vocabulary connections
Vocabulary connectionsVocabulary connections
Vocabulary connections
 
Applied linguistic
Applied linguisticApplied linguistic
Applied linguistic
 
Syntax.ppt
Syntax.pptSyntax.ppt
Syntax.ppt
 
Morphological structure
Morphological structureMorphological structure
Morphological structure
 
TIC'S EFL
TIC'S EFLTIC'S EFL
TIC'S EFL
 
NLP_KASHK: Introduction
NLP_KASHK: Introduction NLP_KASHK: Introduction
NLP_KASHK: Introduction
 
Barreiro-Batista-LR4NLP@Coling2018-presentation
Barreiro-Batista-LR4NLP@Coling2018-presentationBarreiro-Batista-LR4NLP@Coling2018-presentation
Barreiro-Batista-LR4NLP@Coling2018-presentation
 

Más de Hemantha Kulathilake

Más de Hemantha Kulathilake (20)

NLP_KASHK:Parsing with Context-Free Grammar
NLP_KASHK:Parsing with Context-Free Grammar NLP_KASHK:Parsing with Context-Free Grammar
NLP_KASHK:Parsing with Context-Free Grammar
 
NLP_KASHK:Markov Models
NLP_KASHK:Markov ModelsNLP_KASHK:Markov Models
NLP_KASHK:Markov Models
 
NLP_KASHK:Smoothing N-gram Models
NLP_KASHK:Smoothing N-gram ModelsNLP_KASHK:Smoothing N-gram Models
NLP_KASHK:Smoothing N-gram Models
 
NLP_KASHK:Minimum Edit Distance
NLP_KASHK:Minimum Edit DistanceNLP_KASHK:Minimum Edit Distance
NLP_KASHK:Minimum Edit Distance
 
NLP_KASHK:Morphology
NLP_KASHK:MorphologyNLP_KASHK:Morphology
NLP_KASHK:Morphology
 
NLP_KASHK:Regular Expressions
NLP_KASHK:Regular Expressions NLP_KASHK:Regular Expressions
NLP_KASHK:Regular Expressions
 
COM1407: File Processing
COM1407: File Processing COM1407: File Processing
COM1407: File Processing
 
COm1407: Character & Strings
COm1407: Character & StringsCOm1407: Character & Strings
COm1407: Character & Strings
 
COM1407: Structures, Unions & Dynamic Memory Allocation
COM1407: Structures, Unions & Dynamic Memory Allocation COM1407: Structures, Unions & Dynamic Memory Allocation
COM1407: Structures, Unions & Dynamic Memory Allocation
 
COM1407: Input/ Output Functions
COM1407: Input/ Output FunctionsCOM1407: Input/ Output Functions
COM1407: Input/ Output Functions
 
COM1407: Working with Pointers
COM1407: Working with PointersCOM1407: Working with Pointers
COM1407: Working with Pointers
 
COM1407: Arrays
COM1407: ArraysCOM1407: Arrays
COM1407: Arrays
 
COM1407: Program Control Structures – Repetition and Loops
COM1407: Program Control Structures – Repetition and Loops COM1407: Program Control Structures – Repetition and Loops
COM1407: Program Control Structures – Repetition and Loops
 
COM1407: Program Control Structures – Decision Making & Branching
COM1407: Program Control Structures – Decision Making & BranchingCOM1407: Program Control Structures – Decision Making & Branching
COM1407: Program Control Structures – Decision Making & Branching
 
COM1407: C Operators
COM1407: C OperatorsCOM1407: C Operators
COM1407: C Operators
 
COM1407: Type Casting, Command Line Arguments and Defining Constants
COM1407: Type Casting, Command Line Arguments and Defining Constants COM1407: Type Casting, Command Line Arguments and Defining Constants
COM1407: Type Casting, Command Line Arguments and Defining Constants
 
COM1407: Variables and Data Types
COM1407: Variables and Data Types COM1407: Variables and Data Types
COM1407: Variables and Data Types
 
COM1407: Introduction to C Programming
COM1407: Introduction to C Programming COM1407: Introduction to C Programming
COM1407: Introduction to C Programming
 
COM1407: Structured Program Development
COM1407: Structured Program Development COM1407: Structured Program Development
COM1407: Structured Program Development
 
Segmentation Techniques -II
Segmentation Techniques -IISegmentation Techniques -II
Segmentation Techniques -II
 

Último

Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
Epec Engineered Technologies
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
MayuraD1
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
Kamal Acharya
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
chumtiyababu
 

Último (20)

DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to Computers
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal load
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech students
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
 
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLEGEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 

NLP_KASHK:POS Tagging

  • 1. Part of Speech (POS) Tagging K.A.S.H. Kulathilake B.Sc.(Sp.Hons.)IT, MCS, Mphil, SEDA(UK)
  • 2. Closed Class Vs. Open Class • Parts-of-speech can be divided into two broad categories: closed class types and open class types. • Closed classes are those with relatively fixed membership, such as prepositions—new prepositions are rarely coined. • By contrast, nouns and verbs are open classes—new nouns and verbs like iPhone or to fax are continually being created or borrowed. • Any given speaker or corpus may have different open class words, but all speakers of a language, and sufficiently large corpora, likely share the set of closed class words. • Closed class words are generally function word words like of, it, and, or you, which tend to be very short, occur frequently, and often have structuring uses in grammar.
  • 3. Closed Class Vs. Open Class (Cont…) • Open Classes – Four major open classes occur in the languages of the world: • nouns, • verbs, • adjectives and • adverbs.
  • 4. Closed Class Vs. Open Class (Cont…) • Nouns – The syntactic class noun includes the words for most people, places, or things, but others as well. – Nouns include concrete terms like ship and chair, abstractions like bandwidth and relationship, and verb-like terms like pacing as in His pacing to and fro became quite annoying. – What defines a noun in English, then, are things like its ability to occur with determiners (a goat, its bandwidth, Plato’s Republic), to take possessives (IBM’s annual revenue), and for most but not all nouns to occur in the plural form (goats, abaci).
  • 5. Closed Class Vs. Open Class (Cont…) – Open class nouns fall into two classes. – Proper nouns: • like Regina, Colorado, and IBM, are names of specific persons or entities. • In English, they generally aren’t preceded by articles (e.g., the book is upstairs, but Regina is upstairs). • In written English, proper nouns are usually capitalized. – Common nouns: • Common nouns are divided in many languages, including English, into count nouns and mass nouns. • Count nouns allow grammatical enumeration, occurring in both the singular and plural (goat/goats, relationship/relationships) and they can be counted (one goat, two goats). • Mass nouns are used when something is conceptualized as a homogeneous group. • So words like snow, salt, and communism are not counted (i.e., *two snows or *two communisms). • Mass nouns can also appear without articles where singular count nouns cannot (Snow is white but not *Goat is white).
  • 6. Closed Class Vs. Open Class (Cont…) • Verb – The verb class includes most of the words referring to actions and processes, including main verbs like draw, provide, and go. – English verbs have inflections (non-third-person-sg (eat), third-person-sg (eats), progressive (eating), past participle (eaten)).
  • 7. Closed Class Vs. Open Class (Cont…) • Adjective: – A class that includes many terms for properties or qualities. – Most languages have adjectives for the concepts of color (white, black), age (old, young), and value (good, bad), but there are languages without adjectives. – In Korean, for example, the words corresponding to English adjectives act as a subclass of verbs, so what is in English an adjective “beautiful” acts in Korean like a verb meaning “to be beautiful”.
  • 8. Closed Class Vs. Open Class (Cont…) • Adverb: – The final open class form, adverbs, is rather a hodge-podge, both semantically and formally. – In the following sentence from Schachter (1985) all the italicized words are adverbs: – Unfortunately, John walked home extremely slowly yesterday.
  • 9. Closed Class Vs. Open Class (Cont…) – What coherence the class has semantically may be solely that each of these words can be viewed as modifying something (often verbs, hence the name “adverb”, but also other adverbs and entire verb phrases). – Directional adverbs or locative adverbs (home, here, downhill) specify the direction or location of some action; – degree adverbs (extremely, very, somewhat) specify the extent of some action, process, or property; – manner adverbs (slowly, slinkily, delicately) describe the manner of some action or process; – and temporal adverbs describe the time that some action or event took place (yesterday, Monday). – Because of the heterogeneous nature of this class, some adverbs (e.g., temporal adverbs like Monday) are tagged in some tagging schemes as nouns.
  • 10. Closed Class Vs. Open Class (Cont…) • Closed Class – The closed classes differ more from language to language than do the open classes. – Some of the important closed classes in English include: • prepositions: on, under, over, near, by, at, from, to, with • determiners: a, an, the • pronouns: she, who, I, others • conjunctions: and, but, or, as, if, when • auxiliary verbs: can, may, should, are • particles: up, down, on, off, in, out, at, by • numerals: one, two, three, first, second, third
  • 11. The Penn Treebank Part-of-Speech Tagset • While there are many lists of parts-of-speech, most modern language processing on English uses the 45-tag Penn Treebank tagset. • Parts-of-speech are generally represented by placing the tag after each word, delimited by a slash, as in the following examples: – The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./. – There/EX are/VBP 70/CD children/NNS there/RB – Preliminary/JJ findings/NNS were/VBD reported/VBN in/IN today/NN ’s/POS New/NNP England/NNP Journal/NNP of/IN Medicine/NNP ./.
  • 12. The Penn Treebank Part-of-Speech Tagset (Cont…)
  • 13. The Penn Treebank Part-of-Speech Tagset (Cont…) • Corpora labeled with parts-of-speech like the Treebank corpora are crucial training (and testing) sets for statistical tagging algorithms. • Three main tagged corpora are consistently used for training and testing part-of-speech taggers for English: – The Brown corpus is a million words of samples from 500 written texts from different genres published in the United States in 1961. – The WSJ corpus contains a million words published in the Wall Street Journal in 1989. – The Switchboard corpus consists of 2 million words of telephone conversations collected in 1990-1991.
  • 14. The Penn Treebank Part-of-Speech Tagset (Cont…) • The corpora were created by running an automatic part-of-speech tagger on the texts and then human annotators hand-corrected each tag. • Tagging algorithms assume that words have been tokenized before tagging. • The Penn Treebank and the British National Corpus split contractions and the ’s-genitive from their stems: – would/MD n’t/RB – children/NNS ’s/POS • Indeed, the special Treebank tag POS is used only for the morpheme ’s, which must be segmented off during tokenization.
  • 15. The Penn Treebank Part-of-Speech Tagset (Cont…) • Another tokenization issue concerns multipart words. • The Treebank tagset assumes that tokenization of words like New York is done at whitespace. • The phrase a New York City firm is tagged in Treebank notation as five separate words: a/DT New/NNP York/NNP City/NNP firm/NN. • The C5 tagset for the British National Corpus, by contrast, allow prepositions like “in terms of” to be treated as a single word by adding numbers to each tag, as in in/II31 terms/II32 of/II33.
  • 16. POS Tagging • Part-of-speech tagging (tagging for short) is the process of assigning a part-of speech marker to each word in an input text. • Because tags are generally also applied to punctuation, tokenization is usually performed before, or as part of, the tagging process: separating commas, quotation marks, etc., from words and disambiguating end-of-sentence punctuation (period, question mark, etc.) from part-of-word punctuation (such as in abbreviations like e.g. and etc.)
  • 17. POS Tagging (Cont…) • Tagging is a disambiguation task; words are ambiguous— have more than one possible part-of-speech— and the goal is to find the correct tag for the situation. • For example, the word book can be a verb (book that flight) or a noun (as in hand me that book). • That can be a determiner (Does that flight serve dinner) or a complementizer (I thought that your flight was earlier). • The problem of POS-tagging is to resolve these ambiguities, choosing the proper tag for the context. • Part-of-speech tagging is thus one of the many disambiguation tasks in language processing.
  • 18. POS Tagging (Cont…) • Here are some examples of the 6 different parts-of-speech for the word back: – earnings growth took a back/JJ seat – a small building in the back/NN – a clear majority of senators back/VBP the bill – Dave began to back/VB toward the door – enable the country to buy back/RP about debt – I was twenty-one back/RB then
  • 19. POS Tagging (Cont…) • POS Techniques – Rule-Based: Human crafted rules based on lexical and other linguistic knowledge. – Stochastic: Trained on human annotated corpora like the Penn Treebank. Statistical models: Hidden Markov Model (HMM), Maximum Entropy Markov Model (MEMM), Conditional Random Field (CRF) – Transformation Based Tagging: Generally, learning- based approaches have been found to be more effective overall, taking into account the total amount of human expertise and efort involved.
  • 20. HMM POS Tagging • When we apply HMM to part-of-speech tagging we generally don’t use the Baum-Welch algorithm for learning the HMM parameters. • Instead HMMs for part-of-speech tagging are trained on a fully labeled dataset—a set of sentences with each word annotated with a part-of-speech tag— setting parameters by maximum likelihood estimates on this training data. • Thus the only algorithm we will need is the Viterbi algorithm for decoding, and we will also need to see how to set the parameters from training data.
  • 21. The Basic Equation of HMM Tagging • Let’s begin with a quick reminder of the intuition of HMM decoding. • The goal of HMM decoding is to choose the tag sequence that is most probable given the observation sequence of n words 𝑤1 𝑛 : 𝑡′1 𝑛 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑡1 𝑛 𝑃(𝑡1 𝑛 |𝑤1 𝑛 • by using Bayes’ rule to instead compute: 𝑡′1 𝑛 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑡1 𝑛 𝑃 𝑤1 𝑛 |𝑡1 𝑛 𝑃(𝑡1 𝑛 ) 𝑃(𝑤1 𝑛 ) • Furthermore, we simplify above equation by dropping the denominator 𝑃 𝑤1 𝑛 : 𝑡′1 𝑛 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑡1 𝑛 𝑃 𝑤1 𝑛 |𝑡1 𝑛 𝑃(𝑡1 𝑛 )
  • 22. The Basic Equation of HMM Tagging (Cont…) • HMM taggers make two further simplifying assumptions. • The first is that the probability of a word appearing depends only on its own tag and is independent of neighboring words and tags: 𝑃 𝑤1 𝑛 𝑡1 𝑛 ≈ 𝑖=1 𝑛 𝑃(𝑤𝑖|𝑡𝑖) • The second assumption, the bigram assumption, is that the probability of a tag is dependent only on the previous tag, rather than the entire tag sequence; 𝑃(𝑡1 𝑛 ) ≈ 𝑖=1 𝑛 𝑃(𝑡𝑖|𝑡𝑖−1)
  • 23. The Basic Equation of HMM Tagging (Cont…) • Using previous three equations we can derive following equation for the most probable tag sequence from a bigram tagger, which as we will soon see, correspond to the emission probability and transition probability for the HMM. 𝑡′1 𝑛 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑡1 𝑛 𝑃 𝑤1 𝑛 |𝑡1 𝑛 𝑃 𝑡1 𝑛 −→ (1) 𝑃 𝑤1 𝑛 𝑡1 𝑛 ≈ 𝑖=1 𝑛 𝑃 𝑤𝑖 𝑡𝑖 −→ (2) 𝑃 𝑡1 𝑛 ≈ 𝑖=1 𝑛 𝑃 𝑡𝑖 𝑡𝑖−1 −→ (3) Apply (2) and (3) to (1) 𝑡′1 𝑛 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑡1 𝑛 𝑃 𝑡1 𝑛 𝑤1 𝑛 ≈ 𝑎𝑟𝑔𝑚𝑎𝑥 𝑡1 𝑛 𝑖=1 𝑛 𝑃 𝑤𝑖 𝑡𝑖 𝑃 𝑡𝑖 𝑡𝑖−1 Where 𝑃 𝑤𝑖 𝑡𝑖 is emission probability and 𝑃 𝑡𝑖 𝑡𝑖−1 is transition probability.
  • 24. Estimating Probabilities • Let’s walk through an example, seeing how these probabilities are estimated and used in a sample tagging task, before we return to the Viterbi algorithm. • In HMM tagging, rather than using the full power of HMM EM learning, the probabilities are estimated just by counting on a tagged training corpus. • For this example we’ll use the tagged WSJ corpus. • The tag transition probabilities 𝑃 𝑡𝑖 𝑡𝑖−1 represent the probability of a tag given the previous tag.
  • 25. Estimating Probabilities (Cont…) • The maximum likelihood estimate of a transition probability is computed by counting, out of the times we see the first tag in a labeled corpus, how often the first tag is followed by the second
  • 26. Example • Let’s now work through an example of computing the best sequence of tags that corresponds to the following sequence of words • Janet will back the bill • The correct series of tags is: • Janet/NNP will/MD back/VB the/DT bill/NN
  • 28. Example (Cont…) This table is (slightly simplified) from counts in the WSJ corpus. So the word Janet only appears as an NNP, back has 4 possible parts of speech, and the word the can appear as a determiner or as an NNP (in titles like “Somewhere Over the Rainbow” all words are tagged as NNP).
  • 30. Example (Cont…) • Following figure shows a schematic of the possible tags for each word and the correct final tag sequence.