SlideShare una empresa de Scribd logo
1 de 28
Descargar para leer sin conexión
Text Mining 2
Language Modeling
!
Madrid Summer School 2014
Advanced Statistics and Data Mining
!
Florian Leitner
florian.leitner@upm.es
License:
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
A Motivation for Applying
Statistics to NLP
Two sentences:
“Colorless green ideas sleep furiously.” (from Noam Chomsky’s 1955 thesis)
“Furiously ideas green sleep colorless.”
Which one is (grammatically) correct?
!
An unfinished sentence:
“Adam went went jogging with his …”
What is a correct phrase to complete this sentence?
31
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Incentive and Applications
!
• Manual rule-based
language processing would
become cumbersome.
• Word frequencies
(probabilities) are easy to
measure, because large
amounts of texts are
available to us.
• Modeling language based on
probabilities enables many
existing applications:
‣ Machine Translation
‣ Voice Recognition
‣ Predictive Keyboards
‣ Spelling Correction
‣ Part-of-Speech (PoS) Tagging
‣ Linguistic Parsing & Chunking
• (analyzes the grammatical structure of
sentences)
32
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Zipf’s (Power) Law
Word frequency inversely
proportional to its rank (ordered
by counts)
Words of lower rank “clump”
within a region of a document
Word frequency is inversely
proportional to its length
Almost all words are rare
33
f ∝ 1 ÷ r
k = E[f × r]
5 10 15 20 25
0.20.40.60.81.0
Word
Frequency
lexical
richness
k
This is what makes
language modeling hard!
the, be, to, of, … …, malarkey, quodlibet
E = Expected Value
the “true” mean
dim. reduction
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
The Conditional Probability
for Dependent Events
34
Conditional Probability
P(X | Y) = P(X ∩ Y) ÷ P(Y)
*Independence
P(X ∩ Y) = P(X) × P(Y)
P(X | Y) = P(X)
P(Y | X) = P(Y)
` `
Joint Probability
P(X ∩ Y) = P(X, Y) = P(X × Y)
The multiplication principle
for dependent events*:
P(X ∩ Y) = P(Y) × P(X | Y)
and
therefore, by using a little algebra:
X Y
wholen-gram
nextword
vs.
REMINDER
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
The Chain Rule of
Conditional Probability
“The Joint Probability of Dependent Variables”
(don’t say I told you that…)
P(A,B,C,D) = P(A) × P(B|A) × P(C|A,B) × P(D|A,B,C)
35
P(B,Y,B) = P(B) × P(Y|B) × P(B|B,Y,B)
1/10 = 2/5 × 3/4 × 1/3
P(B)
2/5
P(Y|B)
3/4
P(B|B,Y)
1/3
P(w1, ..., wn) =
nY
i=1
P(wi|w1, ..., wi 1) =
nY
i=1
P(wi|wi 1
1 )
NB: the ∑ of all “trigrams” will be 1!
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Probabilistic Language
Modeling
‣ Manning & Schütze. Statistical Natural Language Processing. 1999
• A sentence W is defined as a sequence of words w1, …, wn
• Probability of next word wn in a sentence is: P(wn |w1, …, wn-1)
‣ a conditional probability
• The probability of the whole sentence is: P(W) = P(w1, …, wn)
‣ the chain rule of conditional probability
• These counts & probabilities form the language model

[for a given document collection (=corpus)].
‣ the model variables are discrete (counts)
‣ only needs to deal with probability mass (not density)
36
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Words, Tokens, Shingles
and N-Grams
37
Text with words
“tokenization”
Tokens
2-Shingles
3-Shingles
a.k.a. k-Shingling
This is a sentence .
This is is a a sentence sentence .
This is a is a sentence a sentence .
This is a sentence.
{
{
{
{
{
{
{
NB:
Character-based,
Regular Expressions,
Probabilistic, …
Character N-Grams
all trigrams of “sentence”: [sen, ent, nte, ten, enc, nce]
Token N-Grams
REMINDER
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Modeling the Stochastic Process of
“Generating Words”
Using the Chain Rule
“This is a long sentence having many…” ➡
P(W) = P(this) ×
P(is | this) ×
P(a | this, is) ×
P(long | this, is, a) ×
P(sentence | this, is, a, long) ×
P(having | this, is, a, long, sentence) ×
P(many | this, is, a, long, sentence, having) ×
….
38
n-grams for n > 5:
Insufficient Data
(and expensive to calculate)
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
The Markov Property
A Markov process is a stochastic process who’s future state only
depends on the current state, but not its past.
39
S1
S2S3
a token (i.e., any word,
symbol, …) in S
transitions = likelihood of
that (next) token given the
current token
3 Words (S1,2,3)
Unigram Model
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Modeling the Stochastic Process of
“Generating Words”
Assuming it is Markovian
!
!
• 1st
Order Markov Chains
‣ Bigram Model, k=1, P(be | this)
• 2nd
Order Markov Chains
‣ Trigram Model, k=2, P(long | this, be)
• 3rd
Order Markov Chains
‣ Quadrigram Model, k=3,

P(sentence | this, be, long)
• …
40
Ti−0Ti−1
Ti−0Ti−1Ti−2Ti−3
Ti−0Ti−1Ti−2
dependencies could span over a dozen tokens, but
these sizes are generally sufficient to work by
Y
P(wi|wi 1
1 ) ⇡
Y
P(wi|wi 1
i k) Ti
Unigram “Model”
NB: these are the Dynamic Bayesian
Network representations of the
Markov Chains (see lesson 5)
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
• Unigrams:
• Bigrams:
• n-Grams (n=k+1):
Calculating n-Gram
Probabilities
41
P(wi|wi 1) =
count(wi 1, wi)
count(wi 1)
P(wi) =
count(wi)
N
N = total word count
P(wi|wi 1
i k) =
count(wi
i k)
count(wi 1
i k) k = n-gram size - 1
‣ Language Model:
Practical Tip:
transform all probabilities
to logs to avoid underflows
and work with addition
P(W) =
Y
P(wi|wi 1
i k) =
=
Y
P(wi|wi k, ...wi 1)
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Calculating Probabilities
for the Initial Tokens
• How to calculate the first/last few tokens in n-gram models?
‣ “This is a sentence.” ➡ P(this | ???)
‣ P(wn|wn-k, …, wn-1)
• Fall back to lower-order Markov models
‣ P(w1|wn-k,…,wn-1) = P(w1); P(w2|wn-k,…,wn-1) = P(w2|w1); …
• Fill positions prior to n = 1 with a generic “start token”.
‣ left and/or right padding
‣ conventionally, something like “<s>” and “</s>”, “*” or “·” is used
• (but anything will do, as long as it does not collide with a possible, real token in your system)
42
NB: it is important to maintain the sentence terminal token
to generate robust probability distributions; do not drop it!
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Unseen n-Grams and the
Zero Probability Issue
• Even for unigrams, unseen words will occur sooner or later
• The longer the n-grams, the sooner unseen cases will occur
• “Colorless green ideas sleep furiously.” (Chomsky, 1955)
• As unseen tokens have zero probability, P(W) = 0, too
!
‣ Intuition/Idea: Divert a bit of the overall probability mass of
each [seen] token to all possible unseen tokens
• Terminology: model smoothing (Chen & Goodman, 1998)
43
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Additive / Laplace
Smoothing
• Add a smoothing factor α to all n-gram counts:
!
!
• V is the size of the vocabulary
‣ the number of unique words in the training corpus
• α ≤ 1 usually; if α =1, it is known as Add-One Smoothing
• Very old: first suggested by Pierre-Simon Laplace in 1816
• But it performs poorly (compared to “modern” approaches)
‣ Gale & Church, 1994
44
P(wi|wi 1
i k) =
count(wi
i k) + ↵
count(wi 1
i k) + ↵V
nltk.probability.LidstoneProbDist
nltk.probability.LaplaceProbDist (α = 1)
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Linear Interpolation
Smoothing
• A “Finite Mixture Model” in terms of statistics.
‣ a.k.a. Jelinek-Mercer Smoothing (1980)
• Sum the conditional probability over all n n-grams (the uni-, bi-, trigram, etc.) at
the current token, using the parameter λ for interpolation:
!
!
!
‣ With: ∑ λκ = 1
• Unseen n-gram probabilities can now be interpolated from their “root”
‣ i.e., the n-gram’s 0, 1, …, κ = n-1 interpolated probabilities
• Condition the λ parameter on a held-out development set using Expectation
Maximization or any other MLE optimization technique.
• The “baseline model” in Chen & Goodman’s (1998) study.
45
P(wi|wi 1
i k) =
kX
=0
P(wi|wi 1
i )
[nltk.probability.WittenBellProbDist]
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Advanced Smoothing
Techniques
• Use the frequencies of extremely rare items to estimate
the frequencies of unseen items
• Approach is generally rather costly to calculate
• Good-Turing Estimate (1953 [and WW2 Bletchey Park])
‣ Reallocate probability mass from higher-order n-grams to their lower-
order “stems”.
!
‣ Needs a MLE (max. likelihood estimate) when no higher-order n-gram count
is available.
‣ Higher order n-gram counts tend to be noisy (imprecise because rare)
‣ GTE ≈ “Smoothed Add-One”, and the foundation of most advances
smoothing techniques
46
c⇤
= (c + 1)
Nc+1
Nc
smoothed count:
original count
Nc: number of unique n-
grams of n-gram size c/c+1
nltk.probability.SimpleGoodTuringProbDist
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Interpolated Kneser-Ney
Smoothing
• A lower-order probability should be smoothed proportional to Γ,
the number of different words that it follows.
• In “San Francisco”, “Francisco” alone might not be significant, but will lead to a high unigram probability for
“Francisco”.
!
!
!
!
‣ Subtracts some fraction δ of probability mass from non-zero counts.
‣ Lower-order counts made significantly smaller than their higher-order counts.
• Reduces the mass of “Francisco” with an artificially high unigram probability (because it almost exclusively
occurs as “San Francisco”), so it is less likely to be used to interpolate unseen cases.
‣ (Chen & Goodman, 1998) interpretation of (Kneser & Ney, 1995)
47
the number of different words wi-1 that precede withe total count of n-grams of size n=k+1
1. steal some probability… 2. …reassign it to its lower-order terms wi-1
…
!
!
!
3. …proportional to the word wi’s history
let’s call this a words history
(wi 1
i k, wi) = |{(wi 1
i k, wi) : count(wi 1, wi)}|
PKN (wi|wi
i k) =
max{count(wi
i k) , 0}
Nk
+
(wi 1
i k, wi)
Nk
PKN (wi|wi 1
i k+1)
Nk =
X
i
count(wi
i k)
nltk.probability.KneserNeyProbDist (NLTK3 only)
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Stupid Backoff Smoothing
for Large-scale Datasets
• Brants et al. Large language models in machine translation. 2007
!
!
!
‣ Brants set α = 0.4
• Intuition is similar to Kneser-Ney Smoothing, but no scaling with a
word’s history is performed to make the method efficient.
• Efficiently smoothes billions of n-grams (>1012
[trillions] of tokens)
• Other useful large-scale techniques:
‣ Compression, e.g., Huffman codes (integers) instead of (string) tokens
‣ Transform (string) tokens to (integer) hashes (at the possible cost of collisions)
48
the “backoff”
NB that if you were backing off from a tri- to a unigram, you’d apply ααP(wi)
PSB(wi|wi 1
i k) =
(
P (wi|wi 1
i k) if count(wi
i k) > 0
↵PSB(wi|wi 1
i k+1) otherwise
[nltk.model.NgramModel] (NB: broken in NLTK3!)(Katz Smoothing)
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
The Final Piece:
Language Model Evaluation
• How good is the model you just trained?
• Did your update to the model do any good…?
• Is your model better than someone else’s?
‣ NB: You should compare on the same test set!
49
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Extrinsic Model Evaluation:
Error Rate
• Extrinsic evaluation: minimizing the Error Rate
‣ evaluates a model’s error frequency
‣ estimates the model’s per-word error rate by comparing the generated
sequences to all true sequences (which cannot be established) using a manual
classification of errors (therefore, “extrinsic”)
‣ time consuming, but can evaluate non-probabilistic approaches, too
50
a “perfect prediction” would evaluate to zero
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Intrinsic Model Evaluation:
Cross-Entropy / Perplexity
• Intrinsic evaluation: minimizing Perplexity (Rubinstein, 1997)
‣ compares a model’s probability distribution (to the “perfect” model)
‣ estimates a distance based on cross-entropy (Kullback-Leibler divergence)
between the generated word distribution and the true distribution (which
cannot be observed) using the empirical distribution as a proxy
‣ efficient, but can only evaluate probabilistic language models and is only an
approximation of the model quality
51
again, a “perfect prediction” would evaluate to zero
perplexity = “confusion”
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Shannon Entropy
• Answers the questions:
‣ “How much information (bits) will I gain when I see wn?”
‣ “How predictable is wn from its past?”
!
• Each outcome provides -log2 P bits of information (“surprise”).
‣ Claude E. Shannon, 1948
• The more probable an event (outcome), the lower its entropy.
• A certain event (p=1 or 0) has zero entropy (“no surprise”).
!
• Therefore, the entropy H in our model P for a sentence W is:
‣ H = 1/N × -∑ P(wn|w1, …, wn-1) log2 P(wn|w1, …, wn-1)
52
Image Source:
WikiMedia Commons,
Brona Brejova
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
From Cross-Entropy
to Model Perplexity
• Cross-entropy compares the model probability distribution P to
the true distribution PT:
‣ H(PT, P, W) = 1/N × -∑ PT(wn|wn-k, …, wn-1) log2 P(wn|wn-k, …, wn-1)
• CE can be simplified if the “true” distribution is a “stationary,
ergodic process”:
‣ H(P, W) = 1/N × -∑ log2 P(wn|wn-k, …, wn-1)
• The relationship between perplexity PPX and cross-entropy is
defined as:
‣ PPX(W) = 2H(P, W) = P(W)1/N
• where W is the sentence, P(W) is the Markov model, and N is the number of tokens in it
53
an “unchanging” language -> a rather naïve assumption
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Interpreting Perplexity
• The lower the perplexity, the better (“less surprised”) the
model.
• Perplexity is an indicator of the number of equiprobable
choices at each step.
‣ PPX = 26 for a model generating an infinite random sequence of Latin letters
• “[A-Z]+”
‣ Perplexity produces “big numbers” rather than cross-entropy’s “small bits”
• Typical (bigram) perplexities in English texts range from 50 to almost 1000 corresponding to a cross-
entropy from about 5.6 to 10 bits/word.
‣ Chen & Goodman, 1998
‣ Manning & Schütze, 1999
54
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Practical: Develop your own
Language Model
• Build a 3rd
order Markov model with Linear Interpolation smoothing
and evaluate the perplexity.
• Note: as of NLTK 3.0a4, language modeling is “broken”
‣ try, e.g.: http://www.speech.sri.com/projects/srilm/
• nltk.NgramModel(n, train, pad_left=True,
pad_right=False, estimator=None, *eargs, **ekwds)
‣ pad_* pads sentence starts/ends of n-gram models with empty strings
‣ estimator is a smoothing function that a ConditionalFreqDist to a
ConditionalProbDist
‣ eargs and ekwds are extra arguments and keywords for creating the estimator’s
CPDs.
• How do smoothing parameters and (smaller…) n-gram size change
perplexity?
55
Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Write a simple Spelling
Corrector in pure Python
• http://www.norvig.com/spell-correct.html
• http://www.norvig.com/spell.py
!
from nltk.corpus import brown
from collections import Count
CORPUS = brown.words()
NWORDS = Counter(w.lower() for w in CORPUS if w.isalpha())
56
Bertrand Russell, 1947
“Not to be absolutely certain is, I think, one of the
essential things in rationality.”

Más contenido relacionado

La actualidad más candente

Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)Vsevolod Dyomkin
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingToine Bogers
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingIla Group
 
Natural language procssing
Natural language procssing Natural language procssing
Natural language procssing Rajnish Raj
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)Marina Santini
 
A general method applicable to the search for anglicisms in russian social ne...
A general method applicable to the search for anglicisms in russian social ne...A general method applicable to the search for anglicisms in russian social ne...
A general method applicable to the search for anglicisms in russian social ne...Ilia Karpov
 
Lecture: Summarization
Lecture: SummarizationLecture: Summarization
Lecture: SummarizationMarina Santini
 
A statistical approach to machine translation
A statistical approach to machine translationA statistical approach to machine translation
A statistical approach to machine translationHiroshi Matsumoto
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role LabelingMarina Santini
 
Natural lanaguage processing
Natural lanaguage processingNatural lanaguage processing
Natural lanaguage processinggulshan kumar
 
Temporal Semantic Techniques for Text Analysis and Applications
Temporal Semantic Techniques for Text Analysis and ApplicationsTemporal Semantic Techniques for Text Analysis and Applications
Temporal Semantic Techniques for Text Analysis and ApplicationsFedelucio Narducci
 
NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelHemantha Kulathilake
 
Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Vsevolod Dyomkin
 
Named entity recognition (ner) with nltk
Named entity recognition (ner) with nltkNamed entity recognition (ner) with nltk
Named entity recognition (ner) with nltkJanu Jahnavi
 
Python2 unicode-pt1
Python2 unicode-pt1Python2 unicode-pt1
Python2 unicode-pt1abadger1999
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Jinpyo Lee
 
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopiwan_rg
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsBhaskar Mitra
 
Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Hady Elsahar
 

La actualidad más candente (20)

Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Natural language procssing
Natural language procssing Natural language procssing
Natural language procssing
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)
 
Intro to nlp
Intro to nlpIntro to nlp
Intro to nlp
 
A general method applicable to the search for anglicisms in russian social ne...
A general method applicable to the search for anglicisms in russian social ne...A general method applicable to the search for anglicisms in russian social ne...
A general method applicable to the search for anglicisms in russian social ne...
 
Lecture: Summarization
Lecture: SummarizationLecture: Summarization
Lecture: Summarization
 
A statistical approach to machine translation
A statistical approach to machine translationA statistical approach to machine translation
A statistical approach to machine translation
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role Labeling
 
Natural lanaguage processing
Natural lanaguage processingNatural lanaguage processing
Natural lanaguage processing
 
Temporal Semantic Techniques for Text Analysis and Applications
Temporal Semantic Techniques for Text Analysis and ApplicationsTemporal Semantic Techniques for Text Analysis and Applications
Temporal Semantic Techniques for Text Analysis and Applications
 
NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language Model
 
Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?
 
Named entity recognition (ner) with nltk
Named entity recognition (ner) with nltkNamed entity recognition (ner) with nltk
Named entity recognition (ner) with nltk
 
Python2 unicode-pt1
Python2 unicode-pt1Python2 unicode-pt1
Python2 unicode-pt1
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)
 
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word Embeddings
 
Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Word Embeddings, why the hype ?
Word Embeddings, why the hype ?
 

Similar a OUTDATED Text Mining 2/5: Language Modeling

Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Mustafa Jarrar
 
Time series anomaly discovery with grammar-based compression
Time series anomaly discovery with grammar-based compressionTime series anomaly discovery with grammar-based compression
Time series anomaly discovery with grammar-based compressionPavel Senin
 
lec03-LanguageModels_230214_161016.pdf
lec03-LanguageModels_230214_161016.pdflec03-LanguageModels_230214_161016.pdf
lec03-LanguageModels_230214_161016.pdfykyog
 
Lecture 6
Lecture 6Lecture 6
Lecture 6hunglq
 
Lecture 2: Language
Lecture 2: LanguageLecture 2: Language
Lecture 2: LanguageDavid Evans
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingMLconf
 
Data mining, transfer and learner corpora: Using data mining to discover evid...
Data mining, transfer and learner corpora: Using data mining to discover evid...Data mining, transfer and learner corpora: Using data mining to discover evid...
Data mining, transfer and learner corpora: Using data mining to discover evid...Steve Pepper
 
Pointing the Unknown Words
Pointing the Unknown WordsPointing the Unknown Words
Pointing the Unknown Wordshytae
 
Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4DigiGurukul
 
Implicit Sentiment Mining in Twitter Streams
Implicit Sentiment Mining in Twitter StreamsImplicit Sentiment Mining in Twitter Streams
Implicit Sentiment Mining in Twitter StreamsMaksim Tsvetovat
 
Lecture10.pdf
Lecture10.pdfLecture10.pdf
Lecture10.pdftmmwj1
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingSeonghyun Kim
 
Introduction to Artificial Intelligence
Introduction to Artificial IntelligenceIntroduction to Artificial Intelligence
Introduction to Artificial IntelligenceAI Summary
 

Similar a OUTDATED Text Mining 2/5: Language Modeling (20)

Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing
 
Language models
Language modelsLanguage models
Language models
 
Time series anomaly discovery with grammar-based compression
Time series anomaly discovery with grammar-based compressionTime series anomaly discovery with grammar-based compression
Time series anomaly discovery with grammar-based compression
 
lec03-LanguageModels_230214_161016.pdf
lec03-LanguageModels_230214_161016.pdflec03-LanguageModels_230214_161016.pdf
lec03-LanguageModels_230214_161016.pdf
 
Lecture 6
Lecture 6Lecture 6
Lecture 6
 
Next word Prediction
Next word PredictionNext word Prediction
Next word Prediction
 
Lecture 2: Language
Lecture 2: LanguageLecture 2: Language
Lecture 2: Language
 
intro.ppt
intro.pptintro.ppt
intro.ppt
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
 
Data mining, transfer and learner corpora: Using data mining to discover evid...
Data mining, transfer and learner corpora: Using data mining to discover evid...Data mining, transfer and learner corpora: Using data mining to discover evid...
Data mining, transfer and learner corpora: Using data mining to discover evid...
 
Pointing the Unknown Words
Pointing the Unknown WordsPointing the Unknown Words
Pointing the Unknown Words
 
Word2vec and Friends
Word2vec and FriendsWord2vec and Friends
Word2vec and Friends
 
Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4
 
Implicit Sentiment Mining in Twitter Streams
Implicit Sentiment Mining in Twitter StreamsImplicit Sentiment Mining in Twitter Streams
Implicit Sentiment Mining in Twitter Streams
 
Esa act
Esa actEsa act
Esa act
 
Mcs 031
Mcs 031Mcs 031
Mcs 031
 
Lecture10.pdf
Lecture10.pdfLecture10.pdf
Lecture10.pdf
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
Introduction to Artificial Intelligence
Introduction to Artificial IntelligenceIntroduction to Artificial Intelligence
Introduction to Artificial Intelligence
 
NLP_KASHK:N-Grams
NLP_KASHK:N-GramsNLP_KASHK:N-Grams
NLP_KASHK:N-Grams
 

Último

TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxDwiAyuSitiHartinah
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructuresonikadigital1
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationGiorgio Carbone
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024Becky Burwell
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introductionsanjaymuralee1
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.JasonViviers2
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?sonikadigital1
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityAggregage
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerPavel Šabatka
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionajayrajaganeshkayala
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxVenkatasubramani13
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Guido X Jansen
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Vladislav Solodkiy
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best PracticesDataArchiva
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)Data & Analytics Magazin
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptaigil2
 

Último (17)

TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructure
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - Presentation
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introduction
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayer
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual intervention
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptx
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .ppt
 

OUTDATED Text Mining 2/5: Language Modeling

  • 1. Text Mining 2 Language Modeling ! Madrid Summer School 2014 Advanced Statistics and Data Mining ! Florian Leitner florian.leitner@upm.es License:
  • 2. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining A Motivation for Applying Statistics to NLP Two sentences: “Colorless green ideas sleep furiously.” (from Noam Chomsky’s 1955 thesis) “Furiously ideas green sleep colorless.” Which one is (grammatically) correct? ! An unfinished sentence: “Adam went went jogging with his …” What is a correct phrase to complete this sentence? 31
  • 3. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Incentive and Applications ! • Manual rule-based language processing would become cumbersome. • Word frequencies (probabilities) are easy to measure, because large amounts of texts are available to us. • Modeling language based on probabilities enables many existing applications: ‣ Machine Translation ‣ Voice Recognition ‣ Predictive Keyboards ‣ Spelling Correction ‣ Part-of-Speech (PoS) Tagging ‣ Linguistic Parsing & Chunking • (analyzes the grammatical structure of sentences) 32
  • 4. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Zipf’s (Power) Law Word frequency inversely proportional to its rank (ordered by counts) Words of lower rank “clump” within a region of a document Word frequency is inversely proportional to its length Almost all words are rare 33 f ∝ 1 ÷ r k = E[f × r] 5 10 15 20 25 0.20.40.60.81.0 Word Frequency lexical richness k This is what makes language modeling hard! the, be, to, of, … …, malarkey, quodlibet E = Expected Value the “true” mean dim. reduction
  • 5. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining The Conditional Probability for Dependent Events 34 Conditional Probability P(X | Y) = P(X ∩ Y) ÷ P(Y) *Independence P(X ∩ Y) = P(X) × P(Y) P(X | Y) = P(X) P(Y | X) = P(Y) ` ` Joint Probability P(X ∩ Y) = P(X, Y) = P(X × Y) The multiplication principle for dependent events*: P(X ∩ Y) = P(Y) × P(X | Y) and therefore, by using a little algebra: X Y wholen-gram nextword vs. REMINDER
  • 6. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining The Chain Rule of Conditional Probability “The Joint Probability of Dependent Variables” (don’t say I told you that…) P(A,B,C,D) = P(A) × P(B|A) × P(C|A,B) × P(D|A,B,C) 35 P(B,Y,B) = P(B) × P(Y|B) × P(B|B,Y,B) 1/10 = 2/5 × 3/4 × 1/3 P(B) 2/5 P(Y|B) 3/4 P(B|B,Y) 1/3 P(w1, ..., wn) = nY i=1 P(wi|w1, ..., wi 1) = nY i=1 P(wi|wi 1 1 ) NB: the ∑ of all “trigrams” will be 1!
  • 7. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Probabilistic Language Modeling ‣ Manning & Schütze. Statistical Natural Language Processing. 1999 • A sentence W is defined as a sequence of words w1, …, wn • Probability of next word wn in a sentence is: P(wn |w1, …, wn-1) ‣ a conditional probability • The probability of the whole sentence is: P(W) = P(w1, …, wn) ‣ the chain rule of conditional probability • These counts & probabilities form the language model
 [for a given document collection (=corpus)]. ‣ the model variables are discrete (counts) ‣ only needs to deal with probability mass (not density) 36
  • 8. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Words, Tokens, Shingles and N-Grams 37 Text with words “tokenization” Tokens 2-Shingles 3-Shingles a.k.a. k-Shingling This is a sentence . This is is a a sentence sentence . This is a is a sentence a sentence . This is a sentence. { { { { { { { NB: Character-based, Regular Expressions, Probabilistic, … Character N-Grams all trigrams of “sentence”: [sen, ent, nte, ten, enc, nce] Token N-Grams REMINDER
  • 9. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Modeling the Stochastic Process of “Generating Words” Using the Chain Rule “This is a long sentence having many…” ➡ P(W) = P(this) × P(is | this) × P(a | this, is) × P(long | this, is, a) × P(sentence | this, is, a, long) × P(having | this, is, a, long, sentence) × P(many | this, is, a, long, sentence, having) × …. 38 n-grams for n > 5: Insufficient Data (and expensive to calculate)
  • 10. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining The Markov Property A Markov process is a stochastic process who’s future state only depends on the current state, but not its past. 39 S1 S2S3 a token (i.e., any word, symbol, …) in S transitions = likelihood of that (next) token given the current token 3 Words (S1,2,3) Unigram Model
  • 11. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Modeling the Stochastic Process of “Generating Words” Assuming it is Markovian ! ! • 1st Order Markov Chains ‣ Bigram Model, k=1, P(be | this) • 2nd Order Markov Chains ‣ Trigram Model, k=2, P(long | this, be) • 3rd Order Markov Chains ‣ Quadrigram Model, k=3,
 P(sentence | this, be, long) • … 40 Ti−0Ti−1 Ti−0Ti−1Ti−2Ti−3 Ti−0Ti−1Ti−2 dependencies could span over a dozen tokens, but these sizes are generally sufficient to work by Y P(wi|wi 1 1 ) ⇡ Y P(wi|wi 1 i k) Ti Unigram “Model” NB: these are the Dynamic Bayesian Network representations of the Markov Chains (see lesson 5)
  • 12. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining • Unigrams: • Bigrams: • n-Grams (n=k+1): Calculating n-Gram Probabilities 41 P(wi|wi 1) = count(wi 1, wi) count(wi 1) P(wi) = count(wi) N N = total word count P(wi|wi 1 i k) = count(wi i k) count(wi 1 i k) k = n-gram size - 1 ‣ Language Model: Practical Tip: transform all probabilities to logs to avoid underflows and work with addition P(W) = Y P(wi|wi 1 i k) = = Y P(wi|wi k, ...wi 1)
  • 13. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Calculating Probabilities for the Initial Tokens • How to calculate the first/last few tokens in n-gram models? ‣ “This is a sentence.” ➡ P(this | ???) ‣ P(wn|wn-k, …, wn-1) • Fall back to lower-order Markov models ‣ P(w1|wn-k,…,wn-1) = P(w1); P(w2|wn-k,…,wn-1) = P(w2|w1); … • Fill positions prior to n = 1 with a generic “start token”. ‣ left and/or right padding ‣ conventionally, something like “<s>” and “</s>”, “*” or “·” is used • (but anything will do, as long as it does not collide with a possible, real token in your system) 42 NB: it is important to maintain the sentence terminal token to generate robust probability distributions; do not drop it!
  • 14. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Unseen n-Grams and the Zero Probability Issue • Even for unigrams, unseen words will occur sooner or later • The longer the n-grams, the sooner unseen cases will occur • “Colorless green ideas sleep furiously.” (Chomsky, 1955) • As unseen tokens have zero probability, P(W) = 0, too ! ‣ Intuition/Idea: Divert a bit of the overall probability mass of each [seen] token to all possible unseen tokens • Terminology: model smoothing (Chen & Goodman, 1998) 43
  • 15. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Additive / Laplace Smoothing • Add a smoothing factor α to all n-gram counts: ! ! • V is the size of the vocabulary ‣ the number of unique words in the training corpus • α ≤ 1 usually; if α =1, it is known as Add-One Smoothing • Very old: first suggested by Pierre-Simon Laplace in 1816 • But it performs poorly (compared to “modern” approaches) ‣ Gale & Church, 1994 44 P(wi|wi 1 i k) = count(wi i k) + ↵ count(wi 1 i k) + ↵V nltk.probability.LidstoneProbDist nltk.probability.LaplaceProbDist (α = 1)
  • 16. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Linear Interpolation Smoothing • A “Finite Mixture Model” in terms of statistics. ‣ a.k.a. Jelinek-Mercer Smoothing (1980) • Sum the conditional probability over all n n-grams (the uni-, bi-, trigram, etc.) at the current token, using the parameter λ for interpolation: ! ! ! ‣ With: ∑ λκ = 1 • Unseen n-gram probabilities can now be interpolated from their “root” ‣ i.e., the n-gram’s 0, 1, …, κ = n-1 interpolated probabilities • Condition the λ parameter on a held-out development set using Expectation Maximization or any other MLE optimization technique. • The “baseline model” in Chen & Goodman’s (1998) study. 45 P(wi|wi 1 i k) = kX =0 P(wi|wi 1 i ) [nltk.probability.WittenBellProbDist]
  • 17. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Advanced Smoothing Techniques • Use the frequencies of extremely rare items to estimate the frequencies of unseen items • Approach is generally rather costly to calculate • Good-Turing Estimate (1953 [and WW2 Bletchey Park]) ‣ Reallocate probability mass from higher-order n-grams to their lower- order “stems”. ! ‣ Needs a MLE (max. likelihood estimate) when no higher-order n-gram count is available. ‣ Higher order n-gram counts tend to be noisy (imprecise because rare) ‣ GTE ≈ “Smoothed Add-One”, and the foundation of most advances smoothing techniques 46 c⇤ = (c + 1) Nc+1 Nc smoothed count: original count Nc: number of unique n- grams of n-gram size c/c+1 nltk.probability.SimpleGoodTuringProbDist
  • 18. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Interpolated Kneser-Ney Smoothing • A lower-order probability should be smoothed proportional to Γ, the number of different words that it follows. • In “San Francisco”, “Francisco” alone might not be significant, but will lead to a high unigram probability for “Francisco”. ! ! ! ! ‣ Subtracts some fraction δ of probability mass from non-zero counts. ‣ Lower-order counts made significantly smaller than their higher-order counts. • Reduces the mass of “Francisco” with an artificially high unigram probability (because it almost exclusively occurs as “San Francisco”), so it is less likely to be used to interpolate unseen cases. ‣ (Chen & Goodman, 1998) interpretation of (Kneser & Ney, 1995) 47 the number of different words wi-1 that precede withe total count of n-grams of size n=k+1 1. steal some probability… 2. …reassign it to its lower-order terms wi-1 … ! ! ! 3. …proportional to the word wi’s history let’s call this a words history (wi 1 i k, wi) = |{(wi 1 i k, wi) : count(wi 1, wi)}| PKN (wi|wi i k) = max{count(wi i k) , 0} Nk + (wi 1 i k, wi) Nk PKN (wi|wi 1 i k+1) Nk = X i count(wi i k) nltk.probability.KneserNeyProbDist (NLTK3 only)
  • 19. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Stupid Backoff Smoothing for Large-scale Datasets • Brants et al. Large language models in machine translation. 2007 ! ! ! ‣ Brants set α = 0.4 • Intuition is similar to Kneser-Ney Smoothing, but no scaling with a word’s history is performed to make the method efficient. • Efficiently smoothes billions of n-grams (>1012 [trillions] of tokens) • Other useful large-scale techniques: ‣ Compression, e.g., Huffman codes (integers) instead of (string) tokens ‣ Transform (string) tokens to (integer) hashes (at the possible cost of collisions) 48 the “backoff” NB that if you were backing off from a tri- to a unigram, you’d apply ααP(wi) PSB(wi|wi 1 i k) = ( P (wi|wi 1 i k) if count(wi i k) > 0 ↵PSB(wi|wi 1 i k+1) otherwise [nltk.model.NgramModel] (NB: broken in NLTK3!)(Katz Smoothing)
  • 20. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining The Final Piece: Language Model Evaluation • How good is the model you just trained? • Did your update to the model do any good…? • Is your model better than someone else’s? ‣ NB: You should compare on the same test set! 49
  • 21. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Extrinsic Model Evaluation: Error Rate • Extrinsic evaluation: minimizing the Error Rate ‣ evaluates a model’s error frequency ‣ estimates the model’s per-word error rate by comparing the generated sequences to all true sequences (which cannot be established) using a manual classification of errors (therefore, “extrinsic”) ‣ time consuming, but can evaluate non-probabilistic approaches, too 50 a “perfect prediction” would evaluate to zero
  • 22. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Intrinsic Model Evaluation: Cross-Entropy / Perplexity • Intrinsic evaluation: minimizing Perplexity (Rubinstein, 1997) ‣ compares a model’s probability distribution (to the “perfect” model) ‣ estimates a distance based on cross-entropy (Kullback-Leibler divergence) between the generated word distribution and the true distribution (which cannot be observed) using the empirical distribution as a proxy ‣ efficient, but can only evaluate probabilistic language models and is only an approximation of the model quality 51 again, a “perfect prediction” would evaluate to zero perplexity = “confusion”
  • 23. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Shannon Entropy • Answers the questions: ‣ “How much information (bits) will I gain when I see wn?” ‣ “How predictable is wn from its past?” ! • Each outcome provides -log2 P bits of information (“surprise”). ‣ Claude E. Shannon, 1948 • The more probable an event (outcome), the lower its entropy. • A certain event (p=1 or 0) has zero entropy (“no surprise”). ! • Therefore, the entropy H in our model P for a sentence W is: ‣ H = 1/N × -∑ P(wn|w1, …, wn-1) log2 P(wn|w1, …, wn-1) 52 Image Source: WikiMedia Commons, Brona Brejova
  • 24. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining From Cross-Entropy to Model Perplexity • Cross-entropy compares the model probability distribution P to the true distribution PT: ‣ H(PT, P, W) = 1/N × -∑ PT(wn|wn-k, …, wn-1) log2 P(wn|wn-k, …, wn-1) • CE can be simplified if the “true” distribution is a “stationary, ergodic process”: ‣ H(P, W) = 1/N × -∑ log2 P(wn|wn-k, …, wn-1) • The relationship between perplexity PPX and cross-entropy is defined as: ‣ PPX(W) = 2H(P, W) = P(W)1/N • where W is the sentence, P(W) is the Markov model, and N is the number of tokens in it 53 an “unchanging” language -> a rather naïve assumption
  • 25. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Interpreting Perplexity • The lower the perplexity, the better (“less surprised”) the model. • Perplexity is an indicator of the number of equiprobable choices at each step. ‣ PPX = 26 for a model generating an infinite random sequence of Latin letters • “[A-Z]+” ‣ Perplexity produces “big numbers” rather than cross-entropy’s “small bits” • Typical (bigram) perplexities in English texts range from 50 to almost 1000 corresponding to a cross- entropy from about 5.6 to 10 bits/word. ‣ Chen & Goodman, 1998 ‣ Manning & Schütze, 1999 54
  • 26. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Practical: Develop your own Language Model • Build a 3rd order Markov model with Linear Interpolation smoothing and evaluate the perplexity. • Note: as of NLTK 3.0a4, language modeling is “broken” ‣ try, e.g.: http://www.speech.sri.com/projects/srilm/ • nltk.NgramModel(n, train, pad_left=True, pad_right=False, estimator=None, *eargs, **ekwds) ‣ pad_* pads sentence starts/ends of n-gram models with empty strings ‣ estimator is a smoothing function that a ConditionalFreqDist to a ConditionalProbDist ‣ eargs and ekwds are extra arguments and keywords for creating the estimator’s CPDs. • How do smoothing parameters and (smaller…) n-gram size change perplexity? 55
  • 27. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining Write a simple Spelling Corrector in pure Python • http://www.norvig.com/spell-correct.html • http://www.norvig.com/spell.py ! from nltk.corpus import brown from collections import Count CORPUS = brown.words() NWORDS = Counter(w.lower() for w in CORPUS if w.isalpha()) 56
  • 28. Bertrand Russell, 1947 “Not to be absolutely certain is, I think, one of the essential things in rationality.”