SlideShare una empresa de Scribd logo
1 de 213
Descargar para leer sin conexión
Machine Learningwith Applications in Categorization, Popularity and Sequence labeling
(linear models, decision trees, ensemble methods, evaluation)
Dr. Nicolas Nicolov
<1st_last@yahoo.com>
Goals
• Introduce important ML concepts
• Illustrate ML techniques through examples in:
– Categorization
– Popularity
– Sequence labeling
(tutorial aims to be self-contained and to explain the notation)
2
Outline
• Examples of applications of Machine Learning
• Encoding objects with features
• The Machine Learning framework
• Linear models
– Perceptron, Winnow, Logistic Regression, Robust Risk Minimization (RRM)
• Tree models (Decision Trees DTs)
– Classification Decision Trees, Regression Trees
• Boosting
– AdaBoost
• Ranking evaluation
– Kendall tau and Spearman’s coefficient
• Sequence labeling
– Hidden Markov Models (HMMs)
3
EXAMPLES OF MACHINE LEARNING
Why?– Get a flavor of the diversity of areas where ML is applied.
4
Sequence Labeling
George W. Bush discussed Iraq
GPEXPER_ _PER_ _PER
<PER>George W. Bush</PER> discussed <GPE>Iraq</GPE>
George W. Bush discussed Iraq
Geo-Political Entity
(like search query analysis)
5
Spam
www.dietsthatwork.com
www . dietsthatwork . com
www . diets that work . com
SPAM!
further segmentation
classification
6
Tokenization
What!?I love the iphone:-)
What !? I love the iphone :-)
How difficult can that be? — 98.2% [Zhang et al. 2003]
NO TRESSPASSING
VIOLATORS WILL
BE PROSECUTED
7
NL Parsing
Unlike my sluggish Chevy the Audi handles the winding mountain roads superbly
PREP
POSS
MOD
DET
SUBJ DET
MOD
MOD
MANR
DOBJ
CONTR
syntactic structure
8
State Transitions
λ β
λ β
λ β
λ β
λ β
λ
λ
λ
λ
LEFTARC:
RIGHTARC:
NOARC:
SHIFT:
using ML to make the decision
which action to take 9
Two Ladies in a Men’s Club
10
We serve men
IOBJ
We serve men
DOBJSUBJ
SUBJ
We serve food to men.
We serve our community.
serve —IndirectObject men
We serve organic food.
We serve coffee to connoiseurs.
serve —DirectObject men
11
Audi is an automaker that makes luxury cars
and SUVs. The company was born in
Germany .
It was established by August Horch in
1910. Horch had previosly founded another
company and his models were quite
popular. Audi started with four cylinder
models. By 1914, Horch 's new cars were
racing and winning.
August Horch left the Audi company in
1920 to take a position as an industry
representative for the German motor
vehicle industry federation.
Currently Audi is a subsidiary of the
Volkswagen group and produces cars of
outstanding quality.
Coreference
12
Parts of Objects (Meronymy)
[…] the interior seems upscale with leatherette upholstery that looks and
feels better than the real cow hide found in more expensive vehicles, a
dashboard accented by textured soft-touch materials, a woven mesh
headliner, and other materials that give the New Beetle’s interior a
sense of quality. […] Finally, and a big plus in my book, both front seats were
height adjustable, and the steering column tilted and telescoped for
optimum comfort.
13
Sentiment Analysis
I love pineapple nearly as much as I hate bananas.
POSITIVE sentiment regarding topic pineapple.
Xbox
Xbox
Positive Negative
14
Chinese Sentiment
Car aspects Sentiment categories
Sentence
15
16
17
Categorization
• High-level task:
– Given a restaurant what is its restaurant sub-category?
• Encoding entities with features
• Feature selection
• Linear models
non-standard order
“Though this be madness,
yet there is method in't.”
18
Roadmap
• Examples of applications of Machine Learning
• Encoding objects with features
• The Machine Learning framework
• Linear models
– Perceptron, Winnow , Logistic Regression, Robust Risk Minimization (RRM)
• Tree models (Decision Trees DTs)
– Classification Decision Trees, Regression Trees
• Boosting
– AdaBoost
• Ranking evaluation
– Kendall tau and Spearman’s coefficient
• Sequence labeling
– Hidden Markov Models (HMMs)
19
ENCODING OBJECTS WITH FEATURES
Why?– ML algorithms are “generic”; most of them are cast as solutions around vector encodings of the
domain objects. Regardless of the ML algorithm we will need to represent/encode the domain objects as
feature vectors. How well we do this (the quality of features) directly impacts system performance.
20
Flat
Object
Encoding
1 0 0 1 1 1 0 1 …37
Machine learning (training) instance/example/observation.
Can be a set;
object can belong
to several classes.
Number of
features can
be millions.
21
Structured Objects
to Strings
to Features
a b c d e
Structured object:
f1
f2
f3
f4
f5
f6
“f2:f4>a”
“f2:f4>b”
“f2:f4>c”
…
“f2:f4>a_b”
“f2:f4>b_c”
“f2:f4>c_d”
…
“f2:f4>a_b_c”
“f2:f4>b_c_d”
uni-grams
bi-grams
tri-grams
Feature string Feature index
*DEFAULT* 0
… …
f2:f4>a 100
f2:f4>b 101
f2:f4>c 102
… …
f2:f4>a_b 105
f2:f4>b_c 106
f2:f4>c_d 107
… …
f2:f4>a_b_c 109
Read as field “f2:f4” contains feature “a”.
Table can be quite large.
22
Sliding Window (bi-grams)
SkyCity at the Space Needle
SkyCity at the Space Needle^ $
add initial “^” and final “$” tokens
SkyCity at the Space Needle^ $
SkyCity at the Space Needle^ $
SkyCity at the Space Needle^ $
SkyCity at the Space Needle^ $
sliding window
23
Example: Feature Templates
public static List<string> NGrams( string field )
{
var featutes = new List<string>();
string[] tokens = field.Split( spaceCharArr, System.StringSplitOptions.RemoveEmptyEntries );
featutes.Add( string.Join( "", field.Split(SPLIT_CHARS) ) ); // the entire field
string unigram = string.Empty, bigram, previous1 = "^", previous2 = "^", trigram;
for (int i = 0; i < tokens.Length; i++)
{
unigram = tokens[ i ];
featutes.Add(unigram);
bigram = previous1 + "_" + unigram;
featutes.Add( bigram );
if ( i >= 1 ) { trigram = previous2 + "_" + bigram; featutes.Add( trigram ); }
previous2 = previous1;
previous1 = unigram;
}
featutes.Add( unigram + "_$" );
featutes.Add( bigram + "_$" );
return result;
}
initial tri-gram is: "^_tokens[0]_tokens[1] "
initial bigram is “^_tokens*0]"
last trigram is “tokens*tokens.Length-2]_tokens[tokens.Length-1]_$"
could add field name as argument and prefix all features
24
The Art of Feature Engineering:
Disjunctive Features
• Useful feature = triggers often and with a particular class.
• Rarely occurring (but indicative of a class) features can be
combined in a disjunction. This results in:
– Need for less data to achieve good performance.
– Final system performance (with all available data) is higher.
• How can we get insights about such features: Error analysis!
Regex ITALIAN_FOOD = new Regex(@"al dente|agnello|alfredo|antipasti|antipasto|arrabbiata|bistecca|bolognese|
branzino|caprese|carbonara|carpaccio|cioppino|cozze|fettuccine|filetto|focaccia|frutti di mare|funghi|
gnocchi|gorgonzola|insalata|lasagna|linguine|linguini|macaroni|minestrone|mozzarella|ossobuco|panini| panino|
parmigiana|pasticcio|pecorino|penne|pepperoncini|pesce|pesto|piatti|piatto|piccata|polpo|pomodori|prosciutto|
radicchio|ravioli|ricotta|rigatoni|risotto|saltimbocca|scallopini|scaloppini|spaghetti|tagliatelle|tiramisu|
tortellini|vitello|vongole");
if (ITALIAN_FOOD.Match(entity.description).Success) features.Add("Italian_Food_Matched_Description");
Up to us how we call the feature.Triggering of the feature.
25
instance( class= 7, features=[0,300857,100739,200441,...])
instance( class=99, features=[0,201937,196121,345758,13,...])
instance( class=42, features=[0,99173,358387,1001,1,...])
...
Generic Nature of ML Systems
human sees
computer “sees”
Default feature always triggers.
Number of features that trigger for individual
instances are often not the same.
Indices of (binary) features that trigger.
26
Training Data
Instance /w outcome.
27
Feature Selection
• Templates: powerful way to get lots of features.
• We get too many features.
• Danger of overfitting.
• Feature selection:
– CountCutOff.
– TFxIDF.
– Mutual information.
– Information gain.
– Chi square.
Doing well on seen data but poorly on unseen data.
e.g., 20M for dependency parsing.
Automatic ways of finding discriminative features.
We will examine in detail the implementation of this.
28
Mutual Information
29
Information Gain
Balances effects of feature triggering for an object with
the effects of feature being absent for an object.
30
Chi Square
float Chi2(int a, int b, int c, int d) {
return (a+b+c+d)* ((a*d-b*c)^2) / ((a+b)*(a+c)*(c+d)*(b+d));
}
31
Exponent(Log) Trick
While the final output may not be big intermediate results are. Solution:
float Chi2(int a, int b, int c, int d)
{
return
(a+b+c+d) * ((a*d-b*c)^2) /
((a+b)*(a+c)*(c+d)*(b+d));
}
float Chi2_v2(int a, int b, int c, int d)
{
double total = a + b + c + d;
double n = Math.Log(total);
double num = 2.0 * Math.Log(Math.Abs((a * d) - (b * c)));
double den = Math.Log(a + b) + Math.Log(a + c) + Math.Log(c + d) + Math.Log(b + d);
return (float) Math.Exp(n+num-den);
}
32
Chi Square: Score per Feature
33
Chi Square Feature Selection
int[] featureCounts = new int[ numFeatures ];
int numLabels = labelIndex.Count;
int[] classTotals = new int[ numLabels ]; // instances with that label.
float[] classPriors = new float[ numLabels ]; // class priors: classTotals[label]/numInstances.
int[,] counts = new int[ numLabels, numFeatures ]; // (label,feature) co-occurrence counts.
int numInstances = instances.Count;
...
float[] weightedChiSquareScore = new float[ numFeatures ];
for (int f = 0; f < numFeatures; f++) // f is a feature index
{
float score = 0.0f;
for (int labelIdx = 0; labelIdx < numLabels; labelIdx++)
{
int a = counts[ labelIdx, f ];
int b = classTotals[ labelIdx ] - p;
int c = featureCounts[ f ] - p;
int d = numInstances - ( p + q + r );
if (p >= MIN_SUPPORT && q >= MIN_SUPPORT) { // MIN_SUPPORT = 5
score += classPriors[ labelIdx ] * Chi2( a, b, c, d );
}
}
weightedChiSquareScore[ f ] = score;
}
Do a pass over the data and collect above counts.
Weighted average across all classes.
34
⇒ Summary: Encoding
• Object representation is crucial.
• Humans: good at suggesting features (templates).
• Computers: good at filtering (feature selection).
• Feature engineering: Ensuring systems use the “right”
features.
The system designer does not have to worry about which feature is more
important or useful, and the job is left to the learning algorithm to assign
appropriate weights to the corresponding features. The system designer’s job
is to define a set of features that is large enough to represent most of the
useful information, yet small enough to be manageable for the algorithms and
the infrastructure.
35
Roadmap
• Examples of applications of Machine Learning
• Encoding objects with features
• The Machine Learning framework
• Linear models
– Perceptron, Winnow , Logistic Regression, Robust Risk Minimization (RRM)
• Tree models (Decision Trees DTs)
– Classification Decision Trees, Regression Trees
• Boosting
– AdaBoost
• Ranking evaluation
– Kendall tau and Spearman’s coefficient
• Sequence labeling
– Hidden Markov Models (HMMs)
36
MACHINE LEARNING
GENERAL FRAMEWORK
37
Machine Learning: Representation
classifier
prediction
(response/dependent variable).
Can be qualitative/quantitative
(classification/regression).
Complex decision making:
input/independent variable
38
Notation
39
TRAINING
Machine Learning
Online
System
object encoded with features
classifier
prediction
(response/dependent variable)
Model
Offline
Training
Sub-system
40
Classes of Learning Problems
• Classification: Assign a category to each item (Chinese |
French | Indian | Italian | Japanese restaurant).
• Regression: Predict a real value for each item (stock/currency
value, temperature).
• Ranking: Order items according to some criterion (web search
results relevant to a user query).
• Clustering: Partition items into homogeneous groups
(clustering twitter posts by topic).
• Dimensionality reduction: Transform an initial representation
of items into a lower-dimensional representation while
preserving some properties (preprocessing of digital images).
41
ML Terminology
• Examples: Items or instances used for learning or evaluation.
• Features: Set of attributes represented as a vector associated with an example.
• Labels: Values or categories assigned to examples. In classification the labels are categories; in
regression the labels are real numbers.
• Target: The correct label for a training example. This is extra data that is needed for supervised
learning.
• Output: Prediction label from input set of features using a model of the machine learning algorithm.
• Training sample: Examples used to train a machine learning algorithm.
• Validation sample: Examples used to tune parameters of a learning algorithm.
• Model: Information that the machine learning algorithm stores after training. The model is used
when predicting the output labels of new, unseen examples.
• Test sample: Examples used to evaluate the performance of a learning algorithm. The test sample is
separate from the training and validation data and is not made available in the learning stage.
• Loss function: A function that measures the difference/loss between a predicted label and a true
label. We will design the learning algorithms so that they minimize the error (cumulative loss across
all training examples).
• Hypothesis set: A set of functions mapping features (feature vectors) to the set of labels. The
learning algorithm chooses one function among those in the hypothesis set to return after training.
Usually we pick a class of functions (e.g., linear functions) parameterized by a set of free parameters
(e.g., coefficients of the linear function) and pinpoint the final hypothesis by identifying the
parameters that minimize the error.
• Model selection: Process for selecting the free parameters of the algorithm (actually of the function
in the hypothesis set). 42
Classification
−
+
+
+
+
+
+
+
+ +
+
+ −
−
−
−
−
−
−
−
− −
−
−
decision boundary
Yes, this is mysterious at this point.
43
Multi-Class Classification
44
One-Versus-All (OVA)
For each category in turn, create a binary classifier
where an instance in the data belonging to the
category is considered a positive example, all other
examples are considered negative examples.
Given a new object, run all these binary classifiers
and see which classifier has the “highest
prediction”.
The scores from the different classifiers need to be
calibrated!
45
One-Versus-One (OVO)
For each pair of classes, create binary classifier
on data labeled as either of the classes.
How many such classifiers?
Given a new instance run all classifiers and
predict class with maximum number of wins.
46
Errors
“Nobody is perfect, but then again, who wants to be nobody.”
#misclassified examples
(penalty score of 1 for every misclassified example).
Average error across all instances.
Goal: Minimize the Error.
Beneficial to have differentiable loss function.
47
Error: Function of the Parameters
The cumulative error across all instances is a function of the parameters.
2
1
48
Evaluation
• Motivation:
– Benchmark algorithms (which system is better).
– Tuning parameters during training.
49
Evaluation Measures
GeneralizationError: Probability to misclassify an instance selected according
to the distribution of the labeled instance space
TrainingError: Percentage of training examples which are correctly classified.
Optimistically biased estimate especially
if the inducer over-fits the (training) data.
Empirical estimation of the generalization error:
• Heldout method
• Re-sampling:
1. Random resampling
2. Cross-validation
50
Precision, Recall and F-measure
Let’s consider binary classification:
Space of all instances
Instances identified as
positive by the system.
Positive instances in reality.
System identified
these as positive
but got them
wrong
(false positive).
System identified
these as positive
but got them
correct
(true positive).
System identified
these as negative
but got them
wrong
(false negative).
System identified these as
negative and got them correct
(true negative).
General Setup
51
Accuracy, Precision, Recall,
and F-measure
Definitions
FP: false positives
TP:
true positives
FN: false negatives
TN: true negatives Precision:
Recall:
Accuracy:
F-measure: Harmonic mean of
precision and recall
52
Accuracy vs. Prec/Rec/F-meas
Accuracy can be misleading for evaluating a model with an imbalanced distribution of
the class. When there are more majority class instances than minority class instances,
predicting always the majority class gives good accuracy.
Precision and recall (together) are better indicators.
As a single, aggregate number f-measure favors the lower of the precision or recall.
53
Extreme Cases for Precision & Recall
TP:
true positive
FN: false negatives
TN: true negatives
system actual
all instances
TP: true positives
system
actual
all instances
FP: false positives
Precision can be traded for recall and vice versa.54
Sensitivity & Specificity
FP: false positives
TP:
true positives
FN: false negatives
TN: true negatives
[same as recall;
aka true positive rate]
False positive rate:
Definitions
[aka true negative rate]
False negative rate:
55
Venn Diagrams
John Venn (1880) “On the Diagrammatic and Mechanical
Representation of Propositions and Reasonings”, Philosophical
Magazine and Journal of Science, 5:10(59).
These visualization diagrams were introduced by John Venn:
What if there are three classes?
Four classes?
Six classes?
With more classes our visual intuitions
are helping less and less.
A subtle point: These are just the
actual/real classes without the system
classes drawn on top!
56
Confusion Matrix
Predicted class A Predicted class B Predicted class C
Actual class A
Number of instances
in the actual class A
AND predicted as
belonging to class A.
Number of instances
in the actual class A
BUT predicted as
belonging to class B.
…
Total number of actual
instances of class A
Actual class B … … …
Total number of actual
instances of class B
Actual class C … … …
Total number of actual
instances of class C
Total number of
instances predicted
as class A
Total number of
instances predicted
as class B
Total number of
instances predicted
as class C
Total number of instances
Shows how the predictions of instances of an actual class are distributed across all classes.
Here is an example confusion matrix for three classes:
Counts on the diagonal are the true positives for each class. Counts not on the diagonal are errors.
Confusion matrices can handle many classes. 57
Confusion Matrix:
Accuracy, Precision and Recall
Predicted class A Predicted class B Predicted class C
Actual class A 50 80 70 200
Actual class B 40 140 120 300
Actual class C 120 220 160 500
210 440 350 1000
Given a confusion matrix, it’s easy to compute accuracy, precision and recall:
Confusion matrices can, themselves, be confusing sometimes 58
Roadmap
• Examples of applications of Machine Learning
• Encoding objects with features
• The Machine Learning framework
• Linear models
– Perceptron, Winnow , Logistic Regression, Robust Risk Minimization (RRM)
• Tree models (Decision Trees DTs)
– Classification Decision Trees, Regression Trees
• Boosting
– AdaBoost
• Ranking evaluation
– Kendall tau and Spearman’s coefficient
• Sequence labeling
– Hidden Markov Models (HMMs)
59
LINEAR MODELS
Why?– Linear models are good way to learn about core ML concepts.
60
Refresher: Vectors
point point
vector
vector
vector
points are also vectors.
sum of vectors
Equation of the line.
Can be re-written as:
vector notation
transpose
61
Refresher: Vectors (2)
Equation of the line.
Can be re-written as:
vector notation
Normal vector.
62
Refresher: Dot Product
float DotProduct(float[] v1, float[] v2) {
float sum = 0.0;
for(int i=0; i< a.Length; i++) sum+= v1[i] * v2[i];
return sum;
} 63
Refresher: Pos/Neg Classes
Normal vector.
−
+
64
sgn Function
In mathematics:
We will use:
Informally drawn as:
65
Two Linear Models
Perceptron Linear regression
The features of an object have associated weights indicating their importance.
Signal:
66
Why “Regression”?
Why the term for quantitative output prediction is “regression”?
“That same year [1875], [Francis] Galton decided to recruit some of his friends for an experiment with
sweet peas. He distributed seeds among seven of them, asking them to plant the seeds and return the
offspring. Galton measured the baby seeds and compared their diameters to those of their parents. He
noticed a phenomenon that initially seems counter-intuitive: the large seeds tended to produce smaller
offspring, and the small seeds tended to produce larger offspring. A decade later he analyzed data from his
anthropometric laboratory and recognized the same pattern with human heights. After measuring 205
pairs of parents and their 928 adult children, he saw that exceptionally tall parents had kids who were
generally shorter than they were, while exceptionally short parents had children who were generally taller
than their parents.
After reflecting upon this, we can understand why it must be the case. If very tall parents always produced
even taller children, and if very short parents always produced even shorter ones, we would by now have
turned into a race of giants and midgets. Yet this hasn't happened. Human populations may be getting
taller as a whole – due to better nutrition and public health – but the distribution of heights within the
population is still contained.
Galton called this phenomenon ‘regression towards mediocrity in hereditary stature’. The concept is now
more generally known as regression to the mean.”
[A.Bellos pp.375]
67
On-Line (Sequential) Learning
• On-line = process one example at a time.
• Attractive for large scale problems.
Objective: Minimize cumulative loss:
return parameters
iteration (epoch/time).
Compute loss.
68
On-Line (Sequential) Learning (2)
Sometimes written out more explicitly:
return parameters
# passes over the data.
return parameters
for each data item.
69
Perceptron
−
+
+
+
+
+
+
+
+ +
+
+ −
−
−
−
−
−
−
−
− −
−
−
−
+
+
+
+
+
+
+
+ +
+
+
−
−
−
−
−
−
−
−
−
−
−
−
Linearly separable data: Non-linearly separable data:
+
+
+
−
−
−
70
First: Perceptron Update Rule
−
+
−
+
+
Simplification:
Lines pass through origin.
71
On-Line (Sequential) Learning
72
Perceptron Learning Algorithm
iteration (epoch/time).
return parameters
73
Perceptron Learning Algorithm
return parameters
(algorithm makes multiple passes over data.)
74
Perceptron Learning Algorithm (PLA)
Update weights:
while( mis-classified examples exist ):
Misclassified example means:
With the current weights
1. A challenge: Algorithm will not terminate for non-linearly separable data (outliers, noise).
2. Unstable: jump from good perceptron to really bad one within one update.
3. Attempting to minimize:
NP-hard.
75
Perceptron
Weight update:
76
Looks Simple – Does It Work?
Number of updates by the Perceptron Algorithm
where:
Margin-based upper bound on updates:
Remarkable:
Does not depend on
dimension of feature
space!
Fact:
77
Compact Model Representation
void Save( StreamWriter w, int labelIdx, float[] weights )
{
w.Write( labelIdx );
int previousIndex = 0;
for (int i = 0; i < weights.Length; i++)
{
if (weights[ i ] != 0.0f) {
w.Write( " " + (i - previousIndex) + " " + weights[ i ] );
previousIndex = i;
}
}
}
Use float instead of double:
Store only non-zero weights (and indices):
Store non-zero weights and diff of indices:
Difference of indices.
Remember last index where the weight was non-zero .
78
Linear Classification Solutions
−
+
+
+
+
+
+
+
+ +
+
+ −
−
−
−
−
−
−
−
− −
−
−
Different solutions (infinitely many)
79
The Pocket Algorithm
A better perceptron algorithm:
Keep track of the error and update weights when we lower the error.
Compute error. Expensive step!
Only update the best weights
if we lower the error!
Access to the entire data needed!
80
Voted Perceptron
• Training as the usual perceptron algorithm (with some extra book-keeping).
• Decision rule:
iterations
81
Dual Perceptron: Intuitions
−
+
−
+ separating line.
+
+
+
+
+
++
−
−
−
−
−
−
normal vector
82
Dual Perceptron
return parameters
(algorithm makes multiple passes over data.)
Decision rule:
83
Exclusive OR (XOR) Function
Truth table: Inputs in and color-coding
of the output:
Challenge:
The data is not linearly separable
(no straight line can be drawn
that separates the green from the
blue points).
???
84
Solution for the Exclusive OR (XOR)
We introduce
another input
dimension:
Now the data is linearly separable:
85
Winnow Algorithm
iteration (epoch).
return parameters
Normalizing constant.
Multiplicative
update.
86
Training, Test Error and Complexity
Test error
Training error
Model complexity
87
Logistic Regression
Logistic function:
Target:
Data does not give the
probability explicitly:
88
Logistic Regression
Data likelihood:
Negative log-likelihood:
Error:
89
RefresherDerivative:
Partial derivative:
Gradient (derivatives with respect to each component):
Gradient of the error:
This is a vector and we
can compute it at a point.
Chain rule:
90
Hypothesis Space
Weight space/hyperplane.
[graph from T.Mitchell]
91
Math Fact
The gradient of the error:
(a vector in weight space) specifies the direction of the argument that leads to the
steepest increase for the value of the error.
The negative of the gradient gives the direction of the steepest decrease.
Negative gradient (see next slides).
92
Computing the Gradient
Because gradient is a linear operator.
93
(Batch) Gradient Descent
Compute gradient:
Update weights:
repeat
max #iterations;
marginal error improvement; and
small value for the error.
94
Punch Line
classification rule.
The new object is in the class if:
95
Newton’s Method
-0.5
0
0.5
1
1.5
2
2.5
0 0.5 1 1.5 2 2.5 3
96
Newton-Raphson
97
Robust Risk Minimization
input vector
label
training examples
weight vector
bias
continuous linear model
Prediction rule:
Classification error:
Notation:
98
Robust Classification Loss
Parameter estimation:
Hinge loss:
Robust classification loss:
99
Loss Functions: Comparison
100
Confidence and Regularization
smaller λ corresponds to a larger A.
Confidence
Regularization:
Unconstrained optimization (Lagrange multiplier):
101
Robust Risk Minimization
Go over the training data.
102
Learning Curve
• Plots evaluation metric
against fraction of
training data (on the
same test set!).
• Highest performance
bounded by human inter
annotator agreement
(ITA).
• Leveling off effect that
can guide us how much
data is needed.
0
10
20
30
40
50
60
70
80
90
100
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Percentage of data used
for each experiment.
Experiment with 50% of
the training data yields
evaluation number of 70.
103
Summary
• Examples of ML
• Categorization
• Object encoding
• Linear models:
– Perceptron
– Winnow
– Logistic Regression
– RRM
• Engineering aspects of ML systems
104
PART II: POPULARITY105
Goal
• Quantify how popular an entity is.
Motivation:
• Used in the new local search relevance metric.
106
What is popularity?
107
POPULARITY IN LOCAL SEARCH
108
Popularity
• Output a popularity score (regression)
• Ensemble methods
• Tree base procedure (non-linear)
• Boosting
109
When is a Local Entity Popular?
• Definition:
Visited by many people in the context of alternative choices.
• Is the popularity of restaurants the same as the popularity of
movies, etc.?
• How to operationalize “visit”, “many”, “alternative choices”?
– Initially we are using: popular means clicked more.
• Going forward we will use:
– “visit” = click given an impression.
– “choice” = density of entities in the same primary category.
– “many” = fraction of clicks from impressions. 110
Local Entity Popularity
The model then will be regression:
111
Not all Clicks are Born the Same
• Click in the context of a named query:
– Can even be argued we are not satisfying the user
information needs (and they have to click further to find out
what they are looking for).
• Click in the context of a category query:
– Much more significant (especially when alternative results
are present).
112
Local Entity Popularity
• Popularity & 1st page , current ranker.
• Entities without URL.
• Newly created entities.
• Clicks vs. mouseovers.
• Scenario: 50 French restaurants; best entity
has 2k clicks. 2 Italian restaurants; best entity
has 2k clicks. The French entity is more
popular because of higher available choice.
113
Entity Representation
8000 … 4000 65 4.7 73 … 1 …9000
feature valuesTarget
Machine learning (training) instance
114
POISSON REGRESSION
Why?– We will practice the ML machinery on a different problem, re-iterating the concepts.
Poisson regression is an example of log-linear models good for modeling counts (e.g., number
of visitors to a store in a certain time).
115
Setup
response/outcome
variable
These counts for our scenario are the clicks on the web page.
A good way to model counts of observations is using the Poisson distribution.
explanatory variables
116
Poisson Distribution: Preliminaries
The Poisson distribution realistically describes the pattern of requests over time in many client-server
situations.
Examples are: incoming customers at a bank, calls into a company’s telephone exchange, requests for
storage/retrieval services from a database server, and interrupts to a central processor. It also has higher-
dimensional applications, such as the spatial distribution of defects on integrated circuit wafers and the
volume distribution of contaminants in well water. In such cases, the “events”, which are request arrivals
or defects occurrences, are independent. Customers do not conspire to achieve some special pattern in
their access to a bank teller; rather they operate as independent agents. The manufacture of hard disks
or integrated circuits introduces unavoidable defects because the process pushes the limits of geometric
tolerances. Therefore, a perfectly functional process will still occasionally produce a defect, such as a
small area on the disk surface where the magnetic material is not spread uniformly or a shorted
transistor on an integrated circuit chip. These errors are independent in the sense that a defect at one
point does not influence, for better or worse, the chance of a defect at another point. Moreover, if the
time interval or spatial area is small, the probability of an event is correspondingly small. This is a
characterizing feature of a Poisson distribution: event probability decreases with the window of
opportunity and is linear in the limit. A second characterizing feature, negligible probability of two or
more events in a small interval, is also present in the mentioned examples.
117
Poisson Distribution: Formally
118
Poisson Distribution: Mental Steps
This comes from the theory of Generalized Linear Models (GLM).
log linear combination of the input features.
Hence, the name log-linear model.
119
Poisson Distribution
Data likelihood:
Log-likelihood:
120
Maximizing the Log-Likelihood
121
Roadmap
• Examples of applications of Machine Learning
• Encoding objects with features
• The Machine Learning framework
• Linear models
– Perceptron, Winnow , Logistic Regression, Robust Risk Minimization (RRM)
• Tree models (Decision Trees DTs)
– Classification Decision Trees, Regression Trees
• Boosting
– AdaBoost
• Ranking evaluation
– Kendall tau and Spearman’s coefficient
• Sequence labeling
– Hidden Markov Models (HMMs)
122
DECISION TREES
Why?– DTs are an influential development in ML. Combined in ensemble they provide very competitive
performance. We will see ensemble techniques in the next part.
123
Decision Trees
Binary partitioning of the data during training
(navigating to leaf node during testing).
prediction
Training instances are
more homogeneous
in terms of the output variable
(more pure) compared to
ancestor nodes.
Stopping when instances
are homogeneous or
small number of instances.
Training instances.
Color reflects output variable
(classification example).
124
Decision Tree: Example
Parents
Visiting
Weather
Money
Cinema
CinemaShopping
Stay in
PoorRich
RainyWindySunny
NoYes
Play
tennis
Attribute/feature/predicate
Value of the attribute
Predicted classes.
(classification example with categorical features)
Branching factor depends on
the number of possible values
for the attribute (as seen in the
training set).
125
Entropy (needed for describing how an attribute is selected.)
Example
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1
126
Selecting an Attribute: Information Gain
Measure of expected reduction in entropy.
instances attribute
See Mitchell’97, p.59 for an example.127
Splitting ‘Hairs’
?
If there are only a small number
of instances do not split the node
further (statistics are unreliable).
If there are no instances in the
current node, inherit statistics
(majority class) from parent
node.
If there is more training data, the
tree can be “grown” bigger.128
ID3 Algorithm
129
Alternative Attribute Selection:
Gain Ratio
instances attribute
[Quinlan 1986]
Examples:
all different values.
130
Alternative Attribute Selection:
GINI Index [Corrado Gini: Italian statistician]
131
Space of Possible Decision Trees
Number of possible trees:
132
Decision Trees and Rule Systems
Path from each leaf node to the root represents a conjunctive rule:
Cinema
CinemaShopping
Stay in
PoorRich
RainyWindySunny
NoYes
Play
tennis
if (ParentsVisiting==No) &
(Weather==Windy) &
(Money==Poor)
then
Cinema.
Parents
Visiting
Weather
Money
133
Decision Trees
• Different training sample -> different resulting
tree (different structure).
• Learning does (conditional) feature selection.
134
Regression Trees
Like classification trees but the
prediction is a number
(as suggested by “regression”).
1. How do we split?
2. When to stop?
predictions
(constants)
135
Regression Trees: How to Split
136
Regression Trees: Pruning
Tree operation where a pre-terminal gets its two leaves collapsed:
137
Regression Trees: How to Stop
1. Don’t stop.
2. Build big tree.
3. Prune.
4. Evaluate sub-trees.
138
Roadmap
• Examples of applications of Machine Learning
• Encoding objects with features
• The Machine Learning framework
• Linear models
– Perceptron, Winnow , Logistic Regression, Robust Risk Minimization (RRM)
• Tree models (Decision Trees DTs)
– Classification Decision Trees, Regression Trees
• Boosting
– AdaBoost
• Ranking evaluation
– Kendall tau and Spearman’s coefficient
• Sequence labeling
– Hidden Markov Models (HMMs)
139
BOOSTING
140
ENSEMBLE
Ensemble Methods
object encoded with features
classifiers
predictions
(response/dependent
variable)
…
…
majority voting/averaging
141
Where the Systems Come from
Sequential ensemble scheme:
…
142
Contrast with Bagging
Non-sequential ensemble scheme:
DATA
Datai are independent of each other (likewise for Sytemi).143
Base Procedure:
Decision Tree
SystemData
Binary partitioning of the data during training
(navigating to leaf node during testing).
prediction
Training instances are
more homogeneous
in terms of the output variable
(more pure) compared to
ancestor nodes.
Stopping when instances
are homogeneous or
small number of instances.
Training instances.
Color reflects output variable
(classification example).
144
TRAINING DATA
Ensemble Scheme
base procedure
base procedure
Original data
base procedure
Weighted data
base procedure
Weighted data
Final prediction (regression)
Small systems.
Don’t need to be
perfect.
145
Ada Boost (classification)
Original data
Weighted data
Weighted data
Weighted data
normalizing factor.
final prediction.
146
AdaBoost
Initializing weights.
normalizing factor.
final prediction.
weight update.
147
Binary Classifier
• Constraint:
– Must not have all zero clicks for current week, previous week and week before last
[shopping team uses stronger constraint: only instances with non-zero clicks for
current week].
• Training:
– 1.5M instances.
– 0.5M instances (validation).
• Feature extraction:
– 4.82mins (Cosmos job).
• Training time:
– 2hrs 20mins.
• Testing:
– 10k instances: 1sec.
148
Roadmap
• Examples of applications of Machine Learning
• Encoding objects with features
• The Machine Learning framework
• Linear models
– Perceptron, Winnow , Logistic Regression, Robust Risk Minimization (RRM)
• Tree models (Decision Trees DTs)
– Classification Decision Trees, Regression Trees
• Boosting
– AdaBoost
• Ranking evaluation
– Kendall tau and Spearman’s coefficient
• Sequence labeling
– Hidden Markov Models (HMMs)
149
POPULARITY
EVALUATION
How do we know
we have a good popularity?
150
Rank Correlation Metrics
The two rankings are the same.
The two rankings are reverse of each other.
• •
• •
• •
Actual input is a set of objects with two rank scores (ties are possible). 151
Kendall’s Tau Coefficient
Considers concordant/discordant pairs in two
rankings (each ranking w.r.t. the other):
152
What is a concordant pair?
a a
b c
c b
Need to have the same sign
153
Kendall Tau: Example
A
B
C
D
C
D
A
B
Pairs:
(discordant pairs in red):
Observation: Total number of discordant pairs = 2x the discordant pairs in one ranking w.r.t. the other.
154
Spearman’s Coefficient
Considers ranking differences for the same object:
a a
b c
c b
Example:
155
Rank Intuitions: Setup
The sequence <3,1,4,10,5,9,2,6,8,7> is sufficient to encode the two rankings.
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
156
Rank Intuitions: Pairs
Rankings in complete agreement.
Rankings in complete dis-agreement.
157
Rank Intuitions: Spearman
Segment lengths represent R1 rank scores. 158
Rank Intuitions: Kendall
Segment lengths represent R1 rank scores. 159
What about ties?
The position of an object within set of objects with the
same scores in the rankings affects the rank correlation.
For example, red positioning of oj leads to lower Spearman’s coefficient; green – higher.
160
Ties
• Kendall: Strict discordance:
• Spearman:
– Can use per entity upper and lower bounds.
– Do as in the Olympics:
161
Ties: Kendall TauB
http://en.wikipedia.org/wiki/Kendall_tau#Tau-b
where:
is the number of concordant pairs.
is the number of discordant pairs.
is the number of objects in the two rankings.
162
Uses of popularity
Popularity can be used to augment gain in NDCG by linearly scaling it:
1 3 7 15
1 2 3 4
31
5
perfectexcellentgoodfairpoor
163
Next Steps
• How to determine popularity of new entities
– Challenge: No historical data.
– Usually there is an initial period of high popularity
(e.g., a new restaurant is featured in local
paper, promotions, etc.).
• Good abandonment (no user clicks but good
entity in terms of satisfying the user
information needs, e.g., phone number).
– Use number impressions for named queries.
164
References
1. Yaser S. Abu-Mostafa, Malik Magdon-Ismail & Hsuan-Tien Lin (2012) Learning From Data. AMLBook. [link]
2. Ethem Alpaydin (2009) Introduction to Machine Learning. 2nd edition. Adaptive Computation and Machine Learning series. MIT Press.
[link]
3. David Barber (2012) Bayesian Reasoning and Machine Learning. Cambridge University Press. [link]
4. Ricardo Baeza-Yates & Berthier Ribeiro-Neto (2011) Modern Information Retrieval: The Concepts and Technology behind Search. 2nd
Edition. ACM Press Books. [link]
5. Alex Bellos (2010) Alex's Adventures in Numberland. Bloomsbury: New York. [link]
6. Ron Bekkerman, Mikhail Bilenko & John Langford (2011) Scaling up Machine Learning: Parallel and Distributed Approaches. Cambridge
University Press. [link]
7. Christopher M. Bishop (2007) Pattern Recognition and Machine Learning. Information Science and Statistics. Springer. [link]
8. George Casella & Roger L. Berger (2001) Statistical Inference. 2nd edition. Duxbury Press. [link]
9. Anirban DasGupta (2011) Probability for Statistics and Machine Learning: Fundamentals and Advanced Topics. Springer Texts in Statistics.
Springer. [link]
10. Luc Devroye, László Györfi & Gábor Lugosi (1996) A Probabilistic Theory of Pattern Recognition. Springer. [link]
11. Richard O. Duda, Peter E. Hart & David G. Stork (2000) Pattern Classification. 2nd Edition. Wiley-Interscience. [link]
12. Trevor Hastie, Robert Tibshirani & Jerome Friedman (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction.
2nd Edition. Springer Series in Statistics. Springer. [link]
13. James L. Johnson (2008) Probability and Statistics for Computer Science. Wiley-Interscience. [link]
14. Daphne Koller & Nir Friedman (2009) Probabilistic Graphical Models: Principles and Techniques. Adaptive Computation and Machine
Learning series. MIT Press. [link]
15. David J. C. MacKay (2003) Information Theory, Inference and Learning Algorithms. Cambridge University Press. [link]
16. Zbigniew Michalewicz & David B. Fogel (2004) How to Solve It: Modern Heuristics. 2nd edition. Springer. [link]
17. Tom M. Mitchell (1997) Machine Learning. McGraw-Hill (Science/Engineering/Math). [link]
18. Mehryar Mohri, Afshin Rostamizadeh & Ameet Talwalkar (2012) Foundations of Machine Learning. Adaptive Computation and Machine
Learning series. MIT Press. [link]
19. Lior Rokach (2010) Pattern Classification Using Ensemble Methods. World Scientific. [link]
20. Gilbert Strang (1991) Calculus. Wellesley-Cambridge Press. [link]
21. Larry Wasserman (2010) All of Statistics: A Concise Course in Statistical Inference. Springer Texts in Statistics. Springer. [link]
22. Sholom M. Weiss, Nitin Indurkhya & Tong Zhang (2010) Fundamentals of Predictive Text Mining. Texts in Computer Science. Springer. [link]
165
Roadmap
• Examples of applications of Machine Learning
• Encoding objects with features
• The Machine Learning framework
• Linear models
– Perceptron, Winnow , Logistic Regression, Robust Risk Minimization (RRM)
• Tree models (Decision Trees DTs)
– Classification Decision Trees, Regression Trees
• Boosting
– AdaBoost
• Ranking evaluation
– Kendall tau and Spearman’s coefficient
• Sequence labeling
– Hidden Markov Models (HMMs)
166
SEQUENCE LABELING:
HIDDEN MARKOV MODELS (HMMs)
167
168
Outline
• The guessing game
• Tagging preliminaries
• Hidden Markov Models
• Trellis and the Viterbi algorithm
• Implementation (Python)
• Complexity of decoding
• Parameter estimation and smoothing
• Second order models
169
The Guessing Game
• A cow and duck write an email message together.
• Goal – figure out which word is written by which animal.
The cow/duck illustration of HMMs is due to Ralph Grishman (NYU).
170
What’s the Big Deal ?
• The vocabularies of the cow and the duck can
overlap and it is not clear a priori who wrote a
certain word!
171
The Game (cont)
? ?
moo hello
?
quack
COW ?
moo hello
DUCK
quack
The Game (cont)
COW COW
moo hello
DUCK
quack
DUCK
172
What about the Rest of the Animals?
ZEBRA ZEBRA
word1 word2
ZEBRA
word3
PIG
ZEBRA
word4
ZEBRA
word5
PIG
DUCK
COW
ANT
DUCK
COW
ANT
PIG
DUCK
COW
ANT
PIG
DUCK
COW
ANT
PIG
DUCK
COW
ANT
173
A Game for Adults
• Instead of guessing which animal is associated
with each word guess the corresponding POS
tag of a word.
Pierre/NNP Vinken/NNP ,/, 61/CD years/NNS old/JJ ,/,
will/MD join/VB the/DT board/NN as/IN
a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ./.
174
175
POS Tags
"CC", "CD", "DT", "EX", "FW",
"IN", "JJ", "JJR", "JJS", "LS",
"MD","NN", "NNS","NNP", "NNPS",
"PDT", "POS", "PRP", "PRP$", "RB",
"RBR", "RBS", "RP", "SYM", "TO",
"UH", "VB", "VBD", "VBG", "VBN",
"VBP", "VBZ", "WDT", "WP", "WP$",
"WRB", "#", "$", ".",",",
":", "(", ")", "`", "``",
"'", "''"
176
Tagging Preliminaries
• We want the best set of tags for a sequence of words
(a sentence)
• W — a sequence of words
• T — a sequence of tags
)|(maxarg
^
WTPT
T
177
Bayes’ Theorem (1763)
)(
)()|(
)|(
WP
TPTWP
WTP
posterior
priorlikelihood
marginal likelihood
Reverend Thomas Bayes — Presbyterian minister (1702-1761)
178
Applying Bayes’ Theorem
• How do we approach P(T|W) ?
• Use Bayes’ theorem
• So what? Why is it better?
• Ignore the denominator (and the question):
)(
)()|(
maxarg)|(maxarg
WP
TPTWP
WTP
TT
)()|(maxarg
)(
)()|(
maxarg)|(maxarg TPTWP
WP
TPTWP
WTP
TTT
179
Tag Sequence Probability
• Count the number of times a sequence occurs
and divide by the number of sequences of that
length — not likely!
– Use chain rule
How do we get the probability P(T)
of a specific tag sequence T?
180
P(T) is a product of the probability of the N-grams
that make it up
Make a Markov assumption: the current tag
depends on the previous one only:
Chain Rule
),...,|(...)|()|()(
),...,()(
11213121
1
nn
n
tttPtttPttPtP
ttPTP
history
n
i
iin ttPtPttP
2
111 )|()(),...,(
181
• Use counts from a large hand-tagged corpus.
• For bi-grams, count all the ti–1 ti pairs
• Some counts are zero – we’ll use smoothing to address
this issue later.
Transition Probabilities
)(
)(
)|(
1
1
1
i
ii
ii
tc
ttc
ttP
182
What about P(W|T) ?
• First it's odd—it is asking the probability of seeing “The white
horse” given “Det Adj Noun”!
– Collect up all the times you see that tag sequence and see how often “The
white horse” shows up …
• Assume each word in the sequence depends only on its
corresponding tag:
n
i
ii twPTWP
1
)|()|(
emission probabilities
183
Emission Probabilities
• What proportion of times is the word wi associated with
the tag ti (as opposed to another word):
)(
),(
)|(
i
ii
ii
tc
twc
twP
184
The “Standard” Model
n
i
iiii
T
T
T
T
ttPtwP
TPTWP
WP
TPTWP
WTP
1
1)|()|(maxarg
)()|(maxarg
)(
)()|(
maxarg
)|(maxarg
Hidden Markov Models
• Stochastic process:
A sequence 1 , 2,… of random variables
based on the same sample space .
• Probabilities for the first observation:
• Next step given previous history:
jj xxP outcomeeachfor)( 1
),...,|( 11 11 tt itiit xxxP
185
• A Markov Chain is a stochastic process with the Markov
property:
• Outcomes are called states.
• Probabilities for next step – weighted finite state
automata.
186
Markov Chain
)|(),...,|( 111 111 tttt itititiit xxPxxxP
187
State Transitions w/ Probabilities
START
END
COW
DUCK
1.0
0.2
0.2
0.3 0.3
0.5
0.5
188
Markov Model
Markov chain
where each state
can output signals
(like “Moore machines”):
START
END
COW
DUCK
1.0
0.2
0.2
0.3 0.3
0.5
0.5
moo:0.9
hello:0.1
hello:0.4
quack:0.6
$:1.0^:1.0
189
The Issue Was
• A given output symbol can potentially
be emitted by more than one state —
omnipresent ambiguity in natural language.
Markov Model
190
},...,{ 1 mss
},...,{ 1 k
)|(where][P 1 itjtijij ssPpp
)|(where][A itjtijij sPaa
)(where],...,[ 11 jjm sPvvvv
Finite set of states:
Signal alphabet:
Transition matrix:
Emission probabilities:
Initial probability vector:
191
Graphical Model
STATE TAG
OUTPUT word
…
• A Markov Model for which it is not possible to observe
the sequence of states.
• S: unknown — sequence of states
• O: known — sequence of observations
192
)|(maxarg OSP
S
Hidden Markov Model
*S
*O
wordstags
193
The State Space
START END
COW
DUCK
1.0
0.0
0.2
0.2
0.3
0.3
0.5
0.5
moo:0.9
hello:0.1
hello:0.4
quack:0.6
moo hello quack
COW
DUCK
0.3
0.3
0.5
0.5
COW
DUCK
More on how the probabilities come about (training) later.
194
Optimal State Sequence:
The Viterbi Algorithm
We define the joint probability of the most likely sequence from
time 1 to time t ending in state si and the observed sequence O≤t
up to time t:
);,,...,(max
);,(max)(
11
11
1
11
,...,
1
tititi
ss
titt
S
t
OsssP
OsSPi
t
tii
t
195
Key Observation
The most likely partial derivation leading to state si at
position t consists of:
– the most likely partial derivation leading to some state sit-1
at the previous position t-1,
– followed by the transition from sit-1 to si.
Note:
We will show that:
)|(and)(where)( 111 11 itktikiiiki sPasPvavi
196
Viterbi (cont)
tjkijt
i
t apij ])([max)( 1
197
t
t
t
t
t
t
t
t
t
jktij
i
titt
S
itjt
i
jtktitjt
Si
tittktjt
Si
kttjtitt
Si
tjtt
S
t
aip
OsSPssP
sPssP
OsSsP
OssSP
OsSPj
)]([max
)];,(max)|(max[
)|()|(maxmax
);,|;(maxmax
),;,,(maxmax
);,(max)(
1
1121
1
112
112
1
2
2
2
2
1
Recurrence Equation
)|(
);,(
);,(
112
112
jtkt
titt
titt
sP
OsSP
OsSP
t
1k1k
• The predecessor of state si in the path corresponding to
t(i) :
• Optimal state sequence:
198
Back Pointers
1,...,1for)(
)(argmax
))((argmax)(
**
1
*
1
1
11
ntss
is
pij
ttt
T
kk
n
mi
k
ijt
mi
t
199
The Trellis
START
COW
moo hello quack
DUCK
END
0
t=0
1
0
0
t=1 t=2 t=3 t=4
0
0.9
0
0
0 0 0
0 0
$
0.045
0.108
0.00648
0
0.0081
0
0.0324
0
200
Implementation (Python)
observations = ['^','moo','hello','quack','$'] # signal sequence
states = ['start','cow','duck','end']
# Transition probabilities - p[FromState][ToState] = probability
p = {'start': {'cow':1.0},
'cow': {'cow' :0.5,
'duck':0.3,
'end' :0.2},
'duck': {'duck':0.5,
'cow' :0.3,
'end' :0.2}}
# Emission probabilities; special emission symbol '$' for 'end' state
a = {'cow' : {'moo' :0.9,'hello':0.1, 'quack':0.0, '$':0.0},
'duck': {'moo' :0.0,'hello':0.4, 'quack':0.6, '$':0.0},
'end' : {'moo' :0.0,'hello':0.0, 'quack':0.0, '$':1.0}}
201
Implementation (Viterbi)
n = len(observations)
# Initializing viterbi table row by row; v[state][time] = prob
v = {}; for s in states: v[s] = [0.0] * T
# Initializing back pointers
backPointer = {}; for s in states: backPointer[s] = [""] * T
v['start'][0] = 1.0
for t in range(T-1): # t =[0..T-1]; populate column t+1 in v
for s in states[:-1]: # 'end' state not considered
# only consider 'start' state at time 0
if t == 0 and s != 'start': continue
for s1 in p[s].keys(): # s1 is the next state
newScore = v[s][t] * p[s][s1] * a[s1][observations[t+1]]
if v[s1][t+1] == 0.0 or newScore > v[s1][t+1]:
v[s1][t+1] = newScore
backPointer[s1][t+1] = s
202
Implementation (Best Path)
# Now recover the optimal state sequence
state_sequence = [ 'end' ]
for t in range( n-1, 0, -1 ):
state = backPointer[ state ][ t ]
state_sequence = [ state ] + state_sequence
print "Observations....: ", observations
print "Optimal sequence: ", state_sequence
203
Complexity of Decoding
• O(m2n) — linear in n (the length of the string)
• Initialization: O(mn)
• Back tracing: O(n)
• Next step: O(m2)
for current_state in s1..sm # at time t+1
for prev_state in s1..sm # at time t
compute value
compare with best_so_far
• There are n next steps.
204
Parameter Estimation for HMMs
• Need annotated training data (Brown, PTB).
• Signal and state sequences both known.
• Calculate observed relative frequencies.
• Complications — sparse data problem (need for smoothing).
• One can use only raw data too — Baum-Welch (forward-
backward) algorithm.
205
Optimization
• Build vocabulary of possible tags for words
• Keep total counts for words
• If a word occurs frequently (count > threshold) consider its tag set
exhaustive
• For frequent words only consider its tag set (vs. all tags)
• For unknown words don’t consider tags corresponding to closed
class words (e.g., DT)
206
Applications Using HMMs
• POS tagging (as we have seen).
• Chunking.
• Named Entity Recognition (NER).
• Speech recognition.
207
Exercises
• Implement the training (parameter estimation).
• Use a dictionary of valid tags for known words to constrain
which tags are considered for a word.
• Implement a second-order model.
• Implement the decoder in Ruby.
208
Some POS Taggers
• Alias-I: http://www.alias-i.com/lingpipe
• AUTASYS: http://www.phon.ucl.ac.uk/home/alex/project/tagging/tagging.htm
• Brill Tagger: http://www.cs.jhu.edu/~brill/RBT1_14.tar.Z
• CLAWS: http://www.comp.lancs.ac.uk/computing/research/ucrel/claws/trial.html
• Connexor: http://www.connexor.com/software/tagger
• Edinburgh (LTG): http://www.ltg.ed.ac.uk/software/pos/index.html
• FLAT (Flexible Language Acquisition Tool): http://lanaconsult.com
• fnTBL: http://nlp.cs.jhu.edu/~rflorian/fntbl/index.html
• GATE: http://gate.ac.uk
• Infogistics: http://www.infogistics.com/posdemo.htm
• Qtag: http://www.english.bham.ac.uk/staff/omason/software/qtag.html
• SNoW: http://l2r.cs.uiuc.edu/~cogcomp/asoftware.php?skey=POS
• Stanford: http://nlp.stanford.edu/software/tagger.shtml
• SVMTool: http://www.lsi.upc.edu/~nlp/SVMTool
• TNT: http://www.coli.uni-saarland.de/~thorsten/tnt
• Yamcha: http://chasen.org/~taku/software/yamcha/
209
References
1. Brants, Thorsten. 2000. TnT – A Statistical Part-of-speech Tagger. 6th Applied NLP Conference (ANLP-2000),
224-231, Seattle, U.S.A.
2. Charniak, Eugene, Curtis Hendrickson, Neil Jacobson & Mike Perkowitz. 1993. Equations for part-of-speech
tagging. 11th National Conference on Artificial Intelligence, 784-789. Menlo Park: AAAI Press/MIT.
3. Krenn, Brigitte & Christer Samuelsson. 1997. Statistical Methods in Computational Linguistics, ESSLLI
Summer school Lecture Notes from, 11-22 August, Aix-en-Provence, France.
4. Rabiner, Lawrence R. 1989. A tutorial on Hidden Markov Models and selected applications in speech
recognition, Proceedings of the IEEE, vol. 77, 256-286.
5. Samuelsson, Christer. 2000. Extending N-gram tagging to word graphs, Recent Advances in Natural
Language Processing II, ed. by Nicolas Nicolov & Ruslan Mitkov. Current Issues in Linguistic Theory (CILT),
vol. 189, pp 3-20. John Benjamins: Amsterdam/Philadelphia.
6. Shin, Jung Ho, Young S. Han & Key-Sun Choi. 1997. A HMM part-of-speech tagger for Korean with
wordphrasal relations. Recent Advances in Natural Language Processing, ed. by Nicolas Nicolov & Ruslan
Mitkov. Current Issues in Linguistic Theory (CILT) vol 136, pp 439-450. John Benjamins:
Amsterdam/Philadelphia.
Statistics Refresher
• Outcome: Individual atomic results of a (non-deterministic) experiment.
• Event: A set of results.
• Probability: Limit of target outcome over number of experiments (frequentist
view) or degree of belief (Bayesian view).
• Normalization condition: Probabilities for all outcomes sum to 1.
• Distribution: Probabilities associated with each outcome.
• Random variable: Mapping of the outcomes to real numbers.
• Joint distributions: Conducting several (possibly related) experiments and
observing the results. Joint distribution states the probability for a combination of
values of several random variables.
• Marginal: Finding the distribution of a random variable from a joint distribution.
• Conditional probability (Bayes’ rule): Knowing the value of one variable constrains
the distribution of another.
• Probability density functions: Probability that a continuous variable is in a certain
range.
• Probabilistic reasoning: Introduce evidence (set certain variables) and compute
probabilities of interest (conditioned on this evidence).
210
Definitions
Expectation:
Mode:
Variance:
Expectation of a function:
Properties:
211
Intuitions about Scale
Weight in grams if the Earth were to be a black hole.
Age of the universe in seconds.
Number of cells in the human body (100 trillion).
Number of neurons in the human brain.
Standard Blu-ray disc size, XL 4 layer (128GB).
One year in seconds.
Items in the Library of Congress (largest in the world).
Length of the Niles in meters (longest river).
212
Acknowledgements
• Bran Boguraev
• Chris Brew
• Jinho Choi
• William Headden
• Jingjing Li
• Jason Kessler
• Mike Mozer
• Shumin Wu
• Tong Zhang
• Amir Padovitz
• Bruno Bozza
• Kent Cedola
• Max Galkin
• Manuel Reyes Gomez
• Matt Hurst
• John Langford
• Priyank Singh
213

Más contenido relacionado

La actualidad más candente

Scala collection methods flatMap and flatten are more powerful than monadic f...
Scala collection methods flatMap and flatten are more powerful than monadic f...Scala collection methods flatMap and flatten are more powerful than monadic f...
Scala collection methods flatMap and flatten are more powerful than monadic f...Philip Schwarz
 
Dotnet unit 4
Dotnet unit 4Dotnet unit 4
Dotnet unit 4007laksh
 
Data Visualization 2020_21
Data Visualization 2020_21Data Visualization 2020_21
Data Visualization 2020_21Sangita Panchal
 
Declare Your Language: Type Checking
Declare Your Language: Type CheckingDeclare Your Language: Type Checking
Declare Your Language: Type CheckingEelco Visser
 
Practices For Becoming A Better Programmer
Practices For Becoming A Better ProgrammerPractices For Becoming A Better Programmer
Practices For Becoming A Better ProgrammerSrikanth Shreenivas
 
Monoids, monoids, monoids
Monoids, monoids, monoidsMonoids, monoids, monoids
Monoids, monoids, monoidsLuka Jacobowitz
 
Numerical methods for options pricing
Numerical methods for options pricingNumerical methods for options pricing
Numerical methods for options pricingAiden Wu, FRM
 
computer science sample papers 2
computer science sample papers 2computer science sample papers 2
computer science sample papers 2Swarup Kumar Boro
 
Declare Your Language: Name Resolution
Declare Your Language: Name ResolutionDeclare Your Language: Name Resolution
Declare Your Language: Name ResolutionEelco Visser
 
Declare Your Language: Transformation by Strategic Term Rewriting
Declare Your Language: Transformation by Strategic Term RewritingDeclare Your Language: Transformation by Strategic Term Rewriting
Declare Your Language: Transformation by Strategic Term RewritingEelco Visser
 
CS4200 2019 | Lecture 4 | Syntactic Services
CS4200 2019 | Lecture 4 | Syntactic ServicesCS4200 2019 | Lecture 4 | Syntactic Services
CS4200 2019 | Lecture 4 | Syntactic ServicesEelco Visser
 
Type script by Howard
Type script by HowardType script by Howard
Type script by HowardLearningTech
 
TypeScript by Howard
TypeScript by HowardTypeScript by Howard
TypeScript by HowardLearningTech
 
Traversals for all ocasions
Traversals for all ocasionsTraversals for all ocasions
Traversals for all ocasionsLuka Jacobowitz
 
Why functional programming and category theory strongly matters
Why functional programming and category theory strongly mattersWhy functional programming and category theory strongly matters
Why functional programming and category theory strongly mattersPiotr Paradziński
 
Scientific Computing with Python Webinar March 19: 3D Visualization with Mayavi
Scientific Computing with Python Webinar March 19: 3D Visualization with MayaviScientific Computing with Python Webinar March 19: 3D Visualization with Mayavi
Scientific Computing with Python Webinar March 19: 3D Visualization with MayaviEnthought, Inc.
 
Mit6 087 iap10_lec02
Mit6 087 iap10_lec02Mit6 087 iap10_lec02
Mit6 087 iap10_lec02John Lawrence
 

La actualidad más candente (19)

Scala collection methods flatMap and flatten are more powerful than monadic f...
Scala collection methods flatMap and flatten are more powerful than monadic f...Scala collection methods flatMap and flatten are more powerful than monadic f...
Scala collection methods flatMap and flatten are more powerful than monadic f...
 
Qno 3 (a)
Qno 3 (a)Qno 3 (a)
Qno 3 (a)
 
Dotnet unit 4
Dotnet unit 4Dotnet unit 4
Dotnet unit 4
 
Data Visualization 2020_21
Data Visualization 2020_21Data Visualization 2020_21
Data Visualization 2020_21
 
Declare Your Language: Type Checking
Declare Your Language: Type CheckingDeclare Your Language: Type Checking
Declare Your Language: Type Checking
 
Practices For Becoming A Better Programmer
Practices For Becoming A Better ProgrammerPractices For Becoming A Better Programmer
Practices For Becoming A Better Programmer
 
Monoids, monoids, monoids
Monoids, monoids, monoidsMonoids, monoids, monoids
Monoids, monoids, monoids
 
Numerical methods for options pricing
Numerical methods for options pricingNumerical methods for options pricing
Numerical methods for options pricing
 
computer science sample papers 2
computer science sample papers 2computer science sample papers 2
computer science sample papers 2
 
Declare Your Language: Name Resolution
Declare Your Language: Name ResolutionDeclare Your Language: Name Resolution
Declare Your Language: Name Resolution
 
Declare Your Language: Transformation by Strategic Term Rewriting
Declare Your Language: Transformation by Strategic Term RewritingDeclare Your Language: Transformation by Strategic Term Rewriting
Declare Your Language: Transformation by Strategic Term Rewriting
 
OOP v3
OOP v3OOP v3
OOP v3
 
CS4200 2019 | Lecture 4 | Syntactic Services
CS4200 2019 | Lecture 4 | Syntactic ServicesCS4200 2019 | Lecture 4 | Syntactic Services
CS4200 2019 | Lecture 4 | Syntactic Services
 
Type script by Howard
Type script by HowardType script by Howard
Type script by Howard
 
TypeScript by Howard
TypeScript by HowardTypeScript by Howard
TypeScript by Howard
 
Traversals for all ocasions
Traversals for all ocasionsTraversals for all ocasions
Traversals for all ocasions
 
Why functional programming and category theory strongly matters
Why functional programming and category theory strongly mattersWhy functional programming and category theory strongly matters
Why functional programming and category theory strongly matters
 
Scientific Computing with Python Webinar March 19: 3D Visualization with Mayavi
Scientific Computing with Python Webinar March 19: 3D Visualization with MayaviScientific Computing with Python Webinar March 19: 3D Visualization with Mayavi
Scientific Computing with Python Webinar March 19: 3D Visualization with Mayavi
 
Mit6 087 iap10_lec02
Mit6 087 iap10_lec02Mit6 087 iap10_lec02
Mit6 087 iap10_lec02
 

Similar a Ml5

James Jesus Bermas on Crash Course on Python
James Jesus Bermas on Crash Course on PythonJames Jesus Bermas on Crash Course on Python
James Jesus Bermas on Crash Course on PythonCP-Union
 
Lecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learningLecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learningmy6305874
 
Factorization Machines and Applications in Recommender Systems
Factorization Machines and Applications in Recommender SystemsFactorization Machines and Applications in Recommender Systems
Factorization Machines and Applications in Recommender SystemsEvgeniy Marinov
 
1183 c-interview-questions-and-answers
1183 c-interview-questions-and-answers1183 c-interview-questions-and-answers
1183 c-interview-questions-and-answersAkash Gawali
 
30 分鐘學會實作 Python Feature Selection
30 分鐘學會實作 Python Feature Selection30 分鐘學會實作 Python Feature Selection
30 分鐘學會實作 Python Feature SelectionJames Huang
 
30 分鐘學會實作 Python Feature Selection
30 分鐘學會實作 Python Feature Selection30 分鐘學會實作 Python Feature Selection
30 分鐘學會實作 Python Feature SelectionJames Huang
 
Real-time Face Recognition & Detection Systems 1
Real-time Face Recognition & Detection Systems 1Real-time Face Recognition & Detection Systems 1
Real-time Face Recognition & Detection Systems 1Suvadip Shome
 
C++ Made simple .pptx
C++ Made simple .pptxC++ Made simple .pptx
C++ Made simple .pptxMohamed Essam
 
The Ring programming language version 1.9 book - Part 98 of 210
The Ring programming language version 1.9 book - Part 98 of 210The Ring programming language version 1.9 book - Part 98 of 210
The Ring programming language version 1.9 book - Part 98 of 210Mahmoud Samir Fayed
 
B.sc CSIT 2nd semester C++ Unit2
B.sc CSIT  2nd semester C++ Unit2B.sc CSIT  2nd semester C++ Unit2
B.sc CSIT 2nd semester C++ Unit2Tekendra Nath Yogi
 
EKON 23 Code_review_checklist
EKON 23 Code_review_checklistEKON 23 Code_review_checklist
EKON 23 Code_review_checklistMax Kleiner
 
Begin with Machine Learning
Begin with Machine LearningBegin with Machine Learning
Begin with Machine LearningNarong Intiruk
 
Introduction to Software Engineering with C++
Introduction to Software Engineering  with C++Introduction to Software Engineering  with C++
Introduction to Software Engineering with C++Mohamed Essam
 
Basic terminologies & asymptotic notations
Basic terminologies & asymptotic notationsBasic terminologies & asymptotic notations
Basic terminologies & asymptotic notationsRajendran
 

Similar a Ml5 (20)

James Jesus Bermas on Crash Course on Python
James Jesus Bermas on Crash Course on PythonJames Jesus Bermas on Crash Course on Python
James Jesus Bermas on Crash Course on Python
 
Lecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learningLecture 1 Pandas Basics.pptx machine learning
Lecture 1 Pandas Basics.pptx machine learning
 
Factorization Machines and Applications in Recommender Systems
Factorization Machines and Applications in Recommender SystemsFactorization Machines and Applications in Recommender Systems
Factorization Machines and Applications in Recommender Systems
 
C3 w2
C3 w2C3 w2
C3 w2
 
1183 c-interview-questions-and-answers
1183 c-interview-questions-and-answers1183 c-interview-questions-and-answers
1183 c-interview-questions-and-answers
 
Web_Alg_Project
Web_Alg_ProjectWeb_Alg_Project
Web_Alg_Project
 
30 分鐘學會實作 Python Feature Selection
30 分鐘學會實作 Python Feature Selection30 分鐘學會實作 Python Feature Selection
30 分鐘學會實作 Python Feature Selection
 
30 分鐘學會實作 Python Feature Selection
30 分鐘學會實作 Python Feature Selection30 分鐘學會實作 Python Feature Selection
30 分鐘學會實作 Python Feature Selection
 
Real-time Face Recognition & Detection Systems 1
Real-time Face Recognition & Detection Systems 1Real-time Face Recognition & Detection Systems 1
Real-time Face Recognition & Detection Systems 1
 
C++ Made simple .pptx
C++ Made simple .pptxC++ Made simple .pptx
C++ Made simple .pptx
 
LR2. Summary Day 2
LR2. Summary Day 2LR2. Summary Day 2
LR2. Summary Day 2
 
The Ring programming language version 1.9 book - Part 98 of 210
The Ring programming language version 1.9 book - Part 98 of 210The Ring programming language version 1.9 book - Part 98 of 210
The Ring programming language version 1.9 book - Part 98 of 210
 
B.sc CSIT 2nd semester C++ Unit2
B.sc CSIT  2nd semester C++ Unit2B.sc CSIT  2nd semester C++ Unit2
B.sc CSIT 2nd semester C++ Unit2
 
EKON 23 Code_review_checklist
EKON 23 Code_review_checklistEKON 23 Code_review_checklist
EKON 23 Code_review_checklist
 
Begin with Machine Learning
Begin with Machine LearningBegin with Machine Learning
Begin with Machine Learning
 
Lecture 1 (bce-7)
Lecture   1 (bce-7)Lecture   1 (bce-7)
Lecture 1 (bce-7)
 
Introduction to Software Engineering with C++
Introduction to Software Engineering  with C++Introduction to Software Engineering  with C++
Introduction to Software Engineering with C++
 
MDE in Practice
MDE in PracticeMDE in Practice
MDE in Practice
 
Basic terminologies & asymptotic notations
Basic terminologies & asymptotic notationsBasic terminologies & asymptotic notations
Basic terminologies & asymptotic notations
 
Clanguage
ClanguageClanguage
Clanguage
 

Más de poovarasu maniandan (12)

Spark7
Spark7Spark7
Spark7
 
Spark4
Spark4Spark4
Spark4
 
Spark3
Spark3Spark3
Spark3
 
Spark2
Spark2Spark2
Spark2
 
Ml3
Ml3Ml3
Ml3
 
Ml8
Ml8Ml8
Ml8
 
Ml2
Ml2Ml2
Ml2
 
Ml7
Ml7Ml7
Ml7
 
Blue arm
Blue armBlue arm
Blue arm
 
Literature survey
Literature surveyLiterature survey
Literature survey
 
Home security system using internet of things
Home security system using internet of thingsHome security system using internet of things
Home security system using internet of things
 
rescue robot
rescue robotrescue robot
rescue robot
 

Último

Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024Elizabeth Walsh
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsMebane Rash
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentationcamerronhm
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17Celine George
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxJisc
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxcallscotland1987
 
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdfVishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdfssuserdda66b
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsKarakKing
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSCeline George
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Association for Project Management
 

Último (20)

Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdfVishram Singh - Textbook of Anatomy  Upper Limb and Thorax.. Volume 1 (1).pdf
Vishram Singh - Textbook of Anatomy Upper Limb and Thorax.. Volume 1 (1).pdf
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 

Ml5

  • 1. Machine Learningwith Applications in Categorization, Popularity and Sequence labeling (linear models, decision trees, ensemble methods, evaluation) Dr. Nicolas Nicolov <1st_last@yahoo.com>
  • 2. Goals • Introduce important ML concepts • Illustrate ML techniques through examples in: – Categorization – Popularity – Sequence labeling (tutorial aims to be self-contained and to explain the notation) 2
  • 3. Outline • Examples of applications of Machine Learning • Encoding objects with features • The Machine Learning framework • Linear models – Perceptron, Winnow, Logistic Regression, Robust Risk Minimization (RRM) • Tree models (Decision Trees DTs) – Classification Decision Trees, Regression Trees • Boosting – AdaBoost • Ranking evaluation – Kendall tau and Spearman’s coefficient • Sequence labeling – Hidden Markov Models (HMMs) 3
  • 4. EXAMPLES OF MACHINE LEARNING Why?– Get a flavor of the diversity of areas where ML is applied. 4
  • 5. Sequence Labeling George W. Bush discussed Iraq GPEXPER_ _PER_ _PER <PER>George W. Bush</PER> discussed <GPE>Iraq</GPE> George W. Bush discussed Iraq Geo-Political Entity (like search query analysis) 5
  • 6. Spam www.dietsthatwork.com www . dietsthatwork . com www . diets that work . com SPAM! further segmentation classification 6
  • 7. Tokenization What!?I love the iphone:-) What !? I love the iphone :-) How difficult can that be? — 98.2% [Zhang et al. 2003] NO TRESSPASSING VIOLATORS WILL BE PROSECUTED 7
  • 8. NL Parsing Unlike my sluggish Chevy the Audi handles the winding mountain roads superbly PREP POSS MOD DET SUBJ DET MOD MOD MANR DOBJ CONTR syntactic structure 8
  • 9. State Transitions λ β λ β λ β λ β λ β λ λ λ λ LEFTARC: RIGHTARC: NOARC: SHIFT: using ML to make the decision which action to take 9
  • 10. Two Ladies in a Men’s Club 10
  • 11. We serve men IOBJ We serve men DOBJSUBJ SUBJ We serve food to men. We serve our community. serve —IndirectObject men We serve organic food. We serve coffee to connoiseurs. serve —DirectObject men 11
  • 12. Audi is an automaker that makes luxury cars and SUVs. The company was born in Germany . It was established by August Horch in 1910. Horch had previosly founded another company and his models were quite popular. Audi started with four cylinder models. By 1914, Horch 's new cars were racing and winning. August Horch left the Audi company in 1920 to take a position as an industry representative for the German motor vehicle industry federation. Currently Audi is a subsidiary of the Volkswagen group and produces cars of outstanding quality. Coreference 12
  • 13. Parts of Objects (Meronymy) […] the interior seems upscale with leatherette upholstery that looks and feels better than the real cow hide found in more expensive vehicles, a dashboard accented by textured soft-touch materials, a woven mesh headliner, and other materials that give the New Beetle’s interior a sense of quality. […] Finally, and a big plus in my book, both front seats were height adjustable, and the steering column tilted and telescoped for optimum comfort. 13
  • 14. Sentiment Analysis I love pineapple nearly as much as I hate bananas. POSITIVE sentiment regarding topic pineapple. Xbox Xbox Positive Negative 14
  • 15. Chinese Sentiment Car aspects Sentiment categories Sentence 15
  • 16. 16
  • 17. 17
  • 18. Categorization • High-level task: – Given a restaurant what is its restaurant sub-category? • Encoding entities with features • Feature selection • Linear models non-standard order “Though this be madness, yet there is method in't.” 18
  • 19. Roadmap • Examples of applications of Machine Learning • Encoding objects with features • The Machine Learning framework • Linear models – Perceptron, Winnow , Logistic Regression, Robust Risk Minimization (RRM) • Tree models (Decision Trees DTs) – Classification Decision Trees, Regression Trees • Boosting – AdaBoost • Ranking evaluation – Kendall tau and Spearman’s coefficient • Sequence labeling – Hidden Markov Models (HMMs) 19
  • 20. ENCODING OBJECTS WITH FEATURES Why?– ML algorithms are “generic”; most of them are cast as solutions around vector encodings of the domain objects. Regardless of the ML algorithm we will need to represent/encode the domain objects as feature vectors. How well we do this (the quality of features) directly impacts system performance. 20
  • 21. Flat Object Encoding 1 0 0 1 1 1 0 1 …37 Machine learning (training) instance/example/observation. Can be a set; object can belong to several classes. Number of features can be millions. 21
  • 22. Structured Objects to Strings to Features a b c d e Structured object: f1 f2 f3 f4 f5 f6 “f2:f4>a” “f2:f4>b” “f2:f4>c” … “f2:f4>a_b” “f2:f4>b_c” “f2:f4>c_d” … “f2:f4>a_b_c” “f2:f4>b_c_d” uni-grams bi-grams tri-grams Feature string Feature index *DEFAULT* 0 … … f2:f4>a 100 f2:f4>b 101 f2:f4>c 102 … … f2:f4>a_b 105 f2:f4>b_c 106 f2:f4>c_d 107 … … f2:f4>a_b_c 109 Read as field “f2:f4” contains feature “a”. Table can be quite large. 22
  • 23. Sliding Window (bi-grams) SkyCity at the Space Needle SkyCity at the Space Needle^ $ add initial “^” and final “$” tokens SkyCity at the Space Needle^ $ SkyCity at the Space Needle^ $ SkyCity at the Space Needle^ $ SkyCity at the Space Needle^ $ sliding window 23
  • 24. Example: Feature Templates public static List<string> NGrams( string field ) { var featutes = new List<string>(); string[] tokens = field.Split( spaceCharArr, System.StringSplitOptions.RemoveEmptyEntries ); featutes.Add( string.Join( "", field.Split(SPLIT_CHARS) ) ); // the entire field string unigram = string.Empty, bigram, previous1 = "^", previous2 = "^", trigram; for (int i = 0; i < tokens.Length; i++) { unigram = tokens[ i ]; featutes.Add(unigram); bigram = previous1 + "_" + unigram; featutes.Add( bigram ); if ( i >= 1 ) { trigram = previous2 + "_" + bigram; featutes.Add( trigram ); } previous2 = previous1; previous1 = unigram; } featutes.Add( unigram + "_$" ); featutes.Add( bigram + "_$" ); return result; } initial tri-gram is: "^_tokens[0]_tokens[1] " initial bigram is “^_tokens*0]" last trigram is “tokens*tokens.Length-2]_tokens[tokens.Length-1]_$" could add field name as argument and prefix all features 24
  • 25. The Art of Feature Engineering: Disjunctive Features • Useful feature = triggers often and with a particular class. • Rarely occurring (but indicative of a class) features can be combined in a disjunction. This results in: – Need for less data to achieve good performance. – Final system performance (with all available data) is higher. • How can we get insights about such features: Error analysis! Regex ITALIAN_FOOD = new Regex(@"al dente|agnello|alfredo|antipasti|antipasto|arrabbiata|bistecca|bolognese| branzino|caprese|carbonara|carpaccio|cioppino|cozze|fettuccine|filetto|focaccia|frutti di mare|funghi| gnocchi|gorgonzola|insalata|lasagna|linguine|linguini|macaroni|minestrone|mozzarella|ossobuco|panini| panino| parmigiana|pasticcio|pecorino|penne|pepperoncini|pesce|pesto|piatti|piatto|piccata|polpo|pomodori|prosciutto| radicchio|ravioli|ricotta|rigatoni|risotto|saltimbocca|scallopini|scaloppini|spaghetti|tagliatelle|tiramisu| tortellini|vitello|vongole"); if (ITALIAN_FOOD.Match(entity.description).Success) features.Add("Italian_Food_Matched_Description"); Up to us how we call the feature.Triggering of the feature. 25
  • 26. instance( class= 7, features=[0,300857,100739,200441,...]) instance( class=99, features=[0,201937,196121,345758,13,...]) instance( class=42, features=[0,99173,358387,1001,1,...]) ... Generic Nature of ML Systems human sees computer “sees” Default feature always triggers. Number of features that trigger for individual instances are often not the same. Indices of (binary) features that trigger. 26
  • 28. Feature Selection • Templates: powerful way to get lots of features. • We get too many features. • Danger of overfitting. • Feature selection: – CountCutOff. – TFxIDF. – Mutual information. – Information gain. – Chi square. Doing well on seen data but poorly on unseen data. e.g., 20M for dependency parsing. Automatic ways of finding discriminative features. We will examine in detail the implementation of this. 28
  • 30. Information Gain Balances effects of feature triggering for an object with the effects of feature being absent for an object. 30
  • 31. Chi Square float Chi2(int a, int b, int c, int d) { return (a+b+c+d)* ((a*d-b*c)^2) / ((a+b)*(a+c)*(c+d)*(b+d)); } 31
  • 32. Exponent(Log) Trick While the final output may not be big intermediate results are. Solution: float Chi2(int a, int b, int c, int d) { return (a+b+c+d) * ((a*d-b*c)^2) / ((a+b)*(a+c)*(c+d)*(b+d)); } float Chi2_v2(int a, int b, int c, int d) { double total = a + b + c + d; double n = Math.Log(total); double num = 2.0 * Math.Log(Math.Abs((a * d) - (b * c))); double den = Math.Log(a + b) + Math.Log(a + c) + Math.Log(c + d) + Math.Log(b + d); return (float) Math.Exp(n+num-den); } 32
  • 33. Chi Square: Score per Feature 33
  • 34. Chi Square Feature Selection int[] featureCounts = new int[ numFeatures ]; int numLabels = labelIndex.Count; int[] classTotals = new int[ numLabels ]; // instances with that label. float[] classPriors = new float[ numLabels ]; // class priors: classTotals[label]/numInstances. int[,] counts = new int[ numLabels, numFeatures ]; // (label,feature) co-occurrence counts. int numInstances = instances.Count; ... float[] weightedChiSquareScore = new float[ numFeatures ]; for (int f = 0; f < numFeatures; f++) // f is a feature index { float score = 0.0f; for (int labelIdx = 0; labelIdx < numLabels; labelIdx++) { int a = counts[ labelIdx, f ]; int b = classTotals[ labelIdx ] - p; int c = featureCounts[ f ] - p; int d = numInstances - ( p + q + r ); if (p >= MIN_SUPPORT && q >= MIN_SUPPORT) { // MIN_SUPPORT = 5 score += classPriors[ labelIdx ] * Chi2( a, b, c, d ); } } weightedChiSquareScore[ f ] = score; } Do a pass over the data and collect above counts. Weighted average across all classes. 34
  • 35. ⇒ Summary: Encoding • Object representation is crucial. • Humans: good at suggesting features (templates). • Computers: good at filtering (feature selection). • Feature engineering: Ensuring systems use the “right” features. The system designer does not have to worry about which feature is more important or useful, and the job is left to the learning algorithm to assign appropriate weights to the corresponding features. The system designer’s job is to define a set of features that is large enough to represent most of the useful information, yet small enough to be manageable for the algorithms and the infrastructure. 35
  • 36. Roadmap • Examples of applications of Machine Learning • Encoding objects with features • The Machine Learning framework • Linear models – Perceptron, Winnow , Logistic Regression, Robust Risk Minimization (RRM) • Tree models (Decision Trees DTs) – Classification Decision Trees, Regression Trees • Boosting – AdaBoost • Ranking evaluation – Kendall tau and Spearman’s coefficient • Sequence labeling – Hidden Markov Models (HMMs) 36
  • 38. Machine Learning: Representation classifier prediction (response/dependent variable). Can be qualitative/quantitative (classification/regression). Complex decision making: input/independent variable 38
  • 40. TRAINING Machine Learning Online System object encoded with features classifier prediction (response/dependent variable) Model Offline Training Sub-system 40
  • 41. Classes of Learning Problems • Classification: Assign a category to each item (Chinese | French | Indian | Italian | Japanese restaurant). • Regression: Predict a real value for each item (stock/currency value, temperature). • Ranking: Order items according to some criterion (web search results relevant to a user query). • Clustering: Partition items into homogeneous groups (clustering twitter posts by topic). • Dimensionality reduction: Transform an initial representation of items into a lower-dimensional representation while preserving some properties (preprocessing of digital images). 41
  • 42. ML Terminology • Examples: Items or instances used for learning or evaluation. • Features: Set of attributes represented as a vector associated with an example. • Labels: Values or categories assigned to examples. In classification the labels are categories; in regression the labels are real numbers. • Target: The correct label for a training example. This is extra data that is needed for supervised learning. • Output: Prediction label from input set of features using a model of the machine learning algorithm. • Training sample: Examples used to train a machine learning algorithm. • Validation sample: Examples used to tune parameters of a learning algorithm. • Model: Information that the machine learning algorithm stores after training. The model is used when predicting the output labels of new, unseen examples. • Test sample: Examples used to evaluate the performance of a learning algorithm. The test sample is separate from the training and validation data and is not made available in the learning stage. • Loss function: A function that measures the difference/loss between a predicted label and a true label. We will design the learning algorithms so that they minimize the error (cumulative loss across all training examples). • Hypothesis set: A set of functions mapping features (feature vectors) to the set of labels. The learning algorithm chooses one function among those in the hypothesis set to return after training. Usually we pick a class of functions (e.g., linear functions) parameterized by a set of free parameters (e.g., coefficients of the linear function) and pinpoint the final hypothesis by identifying the parameters that minimize the error. • Model selection: Process for selecting the free parameters of the algorithm (actually of the function in the hypothesis set). 42
  • 43. Classification − + + + + + + + + + + + − − − − − − − − − − − − decision boundary Yes, this is mysterious at this point. 43
  • 45. One-Versus-All (OVA) For each category in turn, create a binary classifier where an instance in the data belonging to the category is considered a positive example, all other examples are considered negative examples. Given a new object, run all these binary classifiers and see which classifier has the “highest prediction”. The scores from the different classifiers need to be calibrated! 45
  • 46. One-Versus-One (OVO) For each pair of classes, create binary classifier on data labeled as either of the classes. How many such classifiers? Given a new instance run all classifiers and predict class with maximum number of wins. 46
  • 47. Errors “Nobody is perfect, but then again, who wants to be nobody.” #misclassified examples (penalty score of 1 for every misclassified example). Average error across all instances. Goal: Minimize the Error. Beneficial to have differentiable loss function. 47
  • 48. Error: Function of the Parameters The cumulative error across all instances is a function of the parameters. 2 1 48
  • 49. Evaluation • Motivation: – Benchmark algorithms (which system is better). – Tuning parameters during training. 49
  • 50. Evaluation Measures GeneralizationError: Probability to misclassify an instance selected according to the distribution of the labeled instance space TrainingError: Percentage of training examples which are correctly classified. Optimistically biased estimate especially if the inducer over-fits the (training) data. Empirical estimation of the generalization error: • Heldout method • Re-sampling: 1. Random resampling 2. Cross-validation 50
  • 51. Precision, Recall and F-measure Let’s consider binary classification: Space of all instances Instances identified as positive by the system. Positive instances in reality. System identified these as positive but got them wrong (false positive). System identified these as positive but got them correct (true positive). System identified these as negative but got them wrong (false negative). System identified these as negative and got them correct (true negative). General Setup 51
  • 52. Accuracy, Precision, Recall, and F-measure Definitions FP: false positives TP: true positives FN: false negatives TN: true negatives Precision: Recall: Accuracy: F-measure: Harmonic mean of precision and recall 52
  • 53. Accuracy vs. Prec/Rec/F-meas Accuracy can be misleading for evaluating a model with an imbalanced distribution of the class. When there are more majority class instances than minority class instances, predicting always the majority class gives good accuracy. Precision and recall (together) are better indicators. As a single, aggregate number f-measure favors the lower of the precision or recall. 53
  • 54. Extreme Cases for Precision & Recall TP: true positive FN: false negatives TN: true negatives system actual all instances TP: true positives system actual all instances FP: false positives Precision can be traded for recall and vice versa.54
  • 55. Sensitivity & Specificity FP: false positives TP: true positives FN: false negatives TN: true negatives [same as recall; aka true positive rate] False positive rate: Definitions [aka true negative rate] False negative rate: 55
  • 56. Venn Diagrams John Venn (1880) “On the Diagrammatic and Mechanical Representation of Propositions and Reasonings”, Philosophical Magazine and Journal of Science, 5:10(59). These visualization diagrams were introduced by John Venn: What if there are three classes? Four classes? Six classes? With more classes our visual intuitions are helping less and less. A subtle point: These are just the actual/real classes without the system classes drawn on top! 56
  • 57. Confusion Matrix Predicted class A Predicted class B Predicted class C Actual class A Number of instances in the actual class A AND predicted as belonging to class A. Number of instances in the actual class A BUT predicted as belonging to class B. … Total number of actual instances of class A Actual class B … … … Total number of actual instances of class B Actual class C … … … Total number of actual instances of class C Total number of instances predicted as class A Total number of instances predicted as class B Total number of instances predicted as class C Total number of instances Shows how the predictions of instances of an actual class are distributed across all classes. Here is an example confusion matrix for three classes: Counts on the diagonal are the true positives for each class. Counts not on the diagonal are errors. Confusion matrices can handle many classes. 57
  • 58. Confusion Matrix: Accuracy, Precision and Recall Predicted class A Predicted class B Predicted class C Actual class A 50 80 70 200 Actual class B 40 140 120 300 Actual class C 120 220 160 500 210 440 350 1000 Given a confusion matrix, it’s easy to compute accuracy, precision and recall: Confusion matrices can, themselves, be confusing sometimes 58
  • 59. Roadmap • Examples of applications of Machine Learning • Encoding objects with features • The Machine Learning framework • Linear models – Perceptron, Winnow , Logistic Regression, Robust Risk Minimization (RRM) • Tree models (Decision Trees DTs) – Classification Decision Trees, Regression Trees • Boosting – AdaBoost • Ranking evaluation – Kendall tau and Spearman’s coefficient • Sequence labeling – Hidden Markov Models (HMMs) 59
  • 60. LINEAR MODELS Why?– Linear models are good way to learn about core ML concepts. 60
  • 61. Refresher: Vectors point point vector vector vector points are also vectors. sum of vectors Equation of the line. Can be re-written as: vector notation transpose 61
  • 62. Refresher: Vectors (2) Equation of the line. Can be re-written as: vector notation Normal vector. 62
  • 63. Refresher: Dot Product float DotProduct(float[] v1, float[] v2) { float sum = 0.0; for(int i=0; i< a.Length; i++) sum+= v1[i] * v2[i]; return sum; } 63
  • 65. sgn Function In mathematics: We will use: Informally drawn as: 65
  • 66. Two Linear Models Perceptron Linear regression The features of an object have associated weights indicating their importance. Signal: 66
  • 67. Why “Regression”? Why the term for quantitative output prediction is “regression”? “That same year [1875], [Francis] Galton decided to recruit some of his friends for an experiment with sweet peas. He distributed seeds among seven of them, asking them to plant the seeds and return the offspring. Galton measured the baby seeds and compared their diameters to those of their parents. He noticed a phenomenon that initially seems counter-intuitive: the large seeds tended to produce smaller offspring, and the small seeds tended to produce larger offspring. A decade later he analyzed data from his anthropometric laboratory and recognized the same pattern with human heights. After measuring 205 pairs of parents and their 928 adult children, he saw that exceptionally tall parents had kids who were generally shorter than they were, while exceptionally short parents had children who were generally taller than their parents. After reflecting upon this, we can understand why it must be the case. If very tall parents always produced even taller children, and if very short parents always produced even shorter ones, we would by now have turned into a race of giants and midgets. Yet this hasn't happened. Human populations may be getting taller as a whole – due to better nutrition and public health – but the distribution of heights within the population is still contained. Galton called this phenomenon ‘regression towards mediocrity in hereditary stature’. The concept is now more generally known as regression to the mean.” [A.Bellos pp.375] 67
  • 68. On-Line (Sequential) Learning • On-line = process one example at a time. • Attractive for large scale problems. Objective: Minimize cumulative loss: return parameters iteration (epoch/time). Compute loss. 68
  • 69. On-Line (Sequential) Learning (2) Sometimes written out more explicitly: return parameters # passes over the data. return parameters for each data item. 69
  • 70. Perceptron − + + + + + + + + + + + − − − − − − − − − − − − − + + + + + + + + + + + − − − − − − − − − − − − Linearly separable data: Non-linearly separable data: + + + − − − 70
  • 71. First: Perceptron Update Rule − + − + + Simplification: Lines pass through origin. 71
  • 73. Perceptron Learning Algorithm iteration (epoch/time). return parameters 73
  • 74. Perceptron Learning Algorithm return parameters (algorithm makes multiple passes over data.) 74
  • 75. Perceptron Learning Algorithm (PLA) Update weights: while( mis-classified examples exist ): Misclassified example means: With the current weights 1. A challenge: Algorithm will not terminate for non-linearly separable data (outliers, noise). 2. Unstable: jump from good perceptron to really bad one within one update. 3. Attempting to minimize: NP-hard. 75
  • 77. Looks Simple – Does It Work? Number of updates by the Perceptron Algorithm where: Margin-based upper bound on updates: Remarkable: Does not depend on dimension of feature space! Fact: 77
  • 78. Compact Model Representation void Save( StreamWriter w, int labelIdx, float[] weights ) { w.Write( labelIdx ); int previousIndex = 0; for (int i = 0; i < weights.Length; i++) { if (weights[ i ] != 0.0f) { w.Write( " " + (i - previousIndex) + " " + weights[ i ] ); previousIndex = i; } } } Use float instead of double: Store only non-zero weights (and indices): Store non-zero weights and diff of indices: Difference of indices. Remember last index where the weight was non-zero . 78
  • 79. Linear Classification Solutions − + + + + + + + + + + + − − − − − − − − − − − − Different solutions (infinitely many) 79
  • 80. The Pocket Algorithm A better perceptron algorithm: Keep track of the error and update weights when we lower the error. Compute error. Expensive step! Only update the best weights if we lower the error! Access to the entire data needed! 80
  • 81. Voted Perceptron • Training as the usual perceptron algorithm (with some extra book-keeping). • Decision rule: iterations 81
  • 82. Dual Perceptron: Intuitions − + − + separating line. + + + + + ++ − − − − − − normal vector 82
  • 83. Dual Perceptron return parameters (algorithm makes multiple passes over data.) Decision rule: 83
  • 84. Exclusive OR (XOR) Function Truth table: Inputs in and color-coding of the output: Challenge: The data is not linearly separable (no straight line can be drawn that separates the green from the blue points). ??? 84
  • 85. Solution for the Exclusive OR (XOR) We introduce another input dimension: Now the data is linearly separable: 85
  • 86. Winnow Algorithm iteration (epoch). return parameters Normalizing constant. Multiplicative update. 86
  • 87. Training, Test Error and Complexity Test error Training error Model complexity 87
  • 88. Logistic Regression Logistic function: Target: Data does not give the probability explicitly: 88
  • 90. RefresherDerivative: Partial derivative: Gradient (derivatives with respect to each component): Gradient of the error: This is a vector and we can compute it at a point. Chain rule: 90
  • 92. Math Fact The gradient of the error: (a vector in weight space) specifies the direction of the argument that leads to the steepest increase for the value of the error. The negative of the gradient gives the direction of the steepest decrease. Negative gradient (see next slides). 92
  • 93. Computing the Gradient Because gradient is a linear operator. 93
  • 94. (Batch) Gradient Descent Compute gradient: Update weights: repeat max #iterations; marginal error improvement; and small value for the error. 94
  • 95. Punch Line classification rule. The new object is in the class if: 95
  • 98. Robust Risk Minimization input vector label training examples weight vector bias continuous linear model Prediction rule: Classification error: Notation: 98
  • 99. Robust Classification Loss Parameter estimation: Hinge loss: Robust classification loss: 99
  • 101. Confidence and Regularization smaller λ corresponds to a larger A. Confidence Regularization: Unconstrained optimization (Lagrange multiplier): 101
  • 102. Robust Risk Minimization Go over the training data. 102
  • 103. Learning Curve • Plots evaluation metric against fraction of training data (on the same test set!). • Highest performance bounded by human inter annotator agreement (ITA). • Leveling off effect that can guide us how much data is needed. 0 10 20 30 40 50 60 70 80 90 100 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Percentage of data used for each experiment. Experiment with 50% of the training data yields evaluation number of 70. 103
  • 104. Summary • Examples of ML • Categorization • Object encoding • Linear models: – Perceptron – Winnow – Logistic Regression – RRM • Engineering aspects of ML systems 104
  • 106. Goal • Quantify how popular an entity is. Motivation: • Used in the new local search relevance metric. 106
  • 108. POPULARITY IN LOCAL SEARCH 108
  • 109. Popularity • Output a popularity score (regression) • Ensemble methods • Tree base procedure (non-linear) • Boosting 109
  • 110. When is a Local Entity Popular? • Definition: Visited by many people in the context of alternative choices. • Is the popularity of restaurants the same as the popularity of movies, etc.? • How to operationalize “visit”, “many”, “alternative choices”? – Initially we are using: popular means clicked more. • Going forward we will use: – “visit” = click given an impression. – “choice” = density of entities in the same primary category. – “many” = fraction of clicks from impressions. 110
  • 111. Local Entity Popularity The model then will be regression: 111
  • 112. Not all Clicks are Born the Same • Click in the context of a named query: – Can even be argued we are not satisfying the user information needs (and they have to click further to find out what they are looking for). • Click in the context of a category query: – Much more significant (especially when alternative results are present). 112
  • 113. Local Entity Popularity • Popularity & 1st page , current ranker. • Entities without URL. • Newly created entities. • Clicks vs. mouseovers. • Scenario: 50 French restaurants; best entity has 2k clicks. 2 Italian restaurants; best entity has 2k clicks. The French entity is more popular because of higher available choice. 113
  • 114. Entity Representation 8000 … 4000 65 4.7 73 … 1 …9000 feature valuesTarget Machine learning (training) instance 114
  • 115. POISSON REGRESSION Why?– We will practice the ML machinery on a different problem, re-iterating the concepts. Poisson regression is an example of log-linear models good for modeling counts (e.g., number of visitors to a store in a certain time). 115
  • 116. Setup response/outcome variable These counts for our scenario are the clicks on the web page. A good way to model counts of observations is using the Poisson distribution. explanatory variables 116
  • 117. Poisson Distribution: Preliminaries The Poisson distribution realistically describes the pattern of requests over time in many client-server situations. Examples are: incoming customers at a bank, calls into a company’s telephone exchange, requests for storage/retrieval services from a database server, and interrupts to a central processor. It also has higher- dimensional applications, such as the spatial distribution of defects on integrated circuit wafers and the volume distribution of contaminants in well water. In such cases, the “events”, which are request arrivals or defects occurrences, are independent. Customers do not conspire to achieve some special pattern in their access to a bank teller; rather they operate as independent agents. The manufacture of hard disks or integrated circuits introduces unavoidable defects because the process pushes the limits of geometric tolerances. Therefore, a perfectly functional process will still occasionally produce a defect, such as a small area on the disk surface where the magnetic material is not spread uniformly or a shorted transistor on an integrated circuit chip. These errors are independent in the sense that a defect at one point does not influence, for better or worse, the chance of a defect at another point. Moreover, if the time interval or spatial area is small, the probability of an event is correspondingly small. This is a characterizing feature of a Poisson distribution: event probability decreases with the window of opportunity and is linear in the limit. A second characterizing feature, negligible probability of two or more events in a small interval, is also present in the mentioned examples. 117
  • 119. Poisson Distribution: Mental Steps This comes from the theory of Generalized Linear Models (GLM). log linear combination of the input features. Hence, the name log-linear model. 119
  • 122. Roadmap • Examples of applications of Machine Learning • Encoding objects with features • The Machine Learning framework • Linear models – Perceptron, Winnow , Logistic Regression, Robust Risk Minimization (RRM) • Tree models (Decision Trees DTs) – Classification Decision Trees, Regression Trees • Boosting – AdaBoost • Ranking evaluation – Kendall tau and Spearman’s coefficient • Sequence labeling – Hidden Markov Models (HMMs) 122
  • 123. DECISION TREES Why?– DTs are an influential development in ML. Combined in ensemble they provide very competitive performance. We will see ensemble techniques in the next part. 123
  • 124. Decision Trees Binary partitioning of the data during training (navigating to leaf node during testing). prediction Training instances are more homogeneous in terms of the output variable (more pure) compared to ancestor nodes. Stopping when instances are homogeneous or small number of instances. Training instances. Color reflects output variable (classification example). 124
  • 125. Decision Tree: Example Parents Visiting Weather Money Cinema CinemaShopping Stay in PoorRich RainyWindySunny NoYes Play tennis Attribute/feature/predicate Value of the attribute Predicted classes. (classification example with categorical features) Branching factor depends on the number of possible values for the attribute (as seen in the training set). 125
  • 126. Entropy (needed for describing how an attribute is selected.) Example 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.2 0.4 0.6 0.8 1 126
  • 127. Selecting an Attribute: Information Gain Measure of expected reduction in entropy. instances attribute See Mitchell’97, p.59 for an example.127
  • 128. Splitting ‘Hairs’ ? If there are only a small number of instances do not split the node further (statistics are unreliable). If there are no instances in the current node, inherit statistics (majority class) from parent node. If there is more training data, the tree can be “grown” bigger.128
  • 130. Alternative Attribute Selection: Gain Ratio instances attribute [Quinlan 1986] Examples: all different values. 130
  • 131. Alternative Attribute Selection: GINI Index [Corrado Gini: Italian statistician] 131
  • 132. Space of Possible Decision Trees Number of possible trees: 132
  • 133. Decision Trees and Rule Systems Path from each leaf node to the root represents a conjunctive rule: Cinema CinemaShopping Stay in PoorRich RainyWindySunny NoYes Play tennis if (ParentsVisiting==No) & (Weather==Windy) & (Money==Poor) then Cinema. Parents Visiting Weather Money 133
  • 134. Decision Trees • Different training sample -> different resulting tree (different structure). • Learning does (conditional) feature selection. 134
  • 135. Regression Trees Like classification trees but the prediction is a number (as suggested by “regression”). 1. How do we split? 2. When to stop? predictions (constants) 135
  • 136. Regression Trees: How to Split 136
  • 137. Regression Trees: Pruning Tree operation where a pre-terminal gets its two leaves collapsed: 137
  • 138. Regression Trees: How to Stop 1. Don’t stop. 2. Build big tree. 3. Prune. 4. Evaluate sub-trees. 138
  • 139. Roadmap • Examples of applications of Machine Learning • Encoding objects with features • The Machine Learning framework • Linear models – Perceptron, Winnow , Logistic Regression, Robust Risk Minimization (RRM) • Tree models (Decision Trees DTs) – Classification Decision Trees, Regression Trees • Boosting – AdaBoost • Ranking evaluation – Kendall tau and Spearman’s coefficient • Sequence labeling – Hidden Markov Models (HMMs) 139
  • 141. ENSEMBLE Ensemble Methods object encoded with features classifiers predictions (response/dependent variable) … … majority voting/averaging 141
  • 142. Where the Systems Come from Sequential ensemble scheme: … 142
  • 143. Contrast with Bagging Non-sequential ensemble scheme: DATA Datai are independent of each other (likewise for Sytemi).143
  • 144. Base Procedure: Decision Tree SystemData Binary partitioning of the data during training (navigating to leaf node during testing). prediction Training instances are more homogeneous in terms of the output variable (more pure) compared to ancestor nodes. Stopping when instances are homogeneous or small number of instances. Training instances. Color reflects output variable (classification example). 144
  • 145. TRAINING DATA Ensemble Scheme base procedure base procedure Original data base procedure Weighted data base procedure Weighted data Final prediction (regression) Small systems. Don’t need to be perfect. 145
  • 146. Ada Boost (classification) Original data Weighted data Weighted data Weighted data normalizing factor. final prediction. 146
  • 148. Binary Classifier • Constraint: – Must not have all zero clicks for current week, previous week and week before last [shopping team uses stronger constraint: only instances with non-zero clicks for current week]. • Training: – 1.5M instances. – 0.5M instances (validation). • Feature extraction: – 4.82mins (Cosmos job). • Training time: – 2hrs 20mins. • Testing: – 10k instances: 1sec. 148
  • 149. Roadmap • Examples of applications of Machine Learning • Encoding objects with features • The Machine Learning framework • Linear models – Perceptron, Winnow , Logistic Regression, Robust Risk Minimization (RRM) • Tree models (Decision Trees DTs) – Classification Decision Trees, Regression Trees • Boosting – AdaBoost • Ranking evaluation – Kendall tau and Spearman’s coefficient • Sequence labeling – Hidden Markov Models (HMMs) 149
  • 150. POPULARITY EVALUATION How do we know we have a good popularity? 150
  • 151. Rank Correlation Metrics The two rankings are the same. The two rankings are reverse of each other. • • • • • • Actual input is a set of objects with two rank scores (ties are possible). 151
  • 152. Kendall’s Tau Coefficient Considers concordant/discordant pairs in two rankings (each ranking w.r.t. the other): 152
  • 153. What is a concordant pair? a a b c c b Need to have the same sign 153
  • 154. Kendall Tau: Example A B C D C D A B Pairs: (discordant pairs in red): Observation: Total number of discordant pairs = 2x the discordant pairs in one ranking w.r.t. the other. 154
  • 155. Spearman’s Coefficient Considers ranking differences for the same object: a a b c c b Example: 155
  • 156. Rank Intuitions: Setup The sequence <3,1,4,10,5,9,2,6,8,7> is sufficient to encode the two rankings. 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 156
  • 157. Rank Intuitions: Pairs Rankings in complete agreement. Rankings in complete dis-agreement. 157
  • 158. Rank Intuitions: Spearman Segment lengths represent R1 rank scores. 158
  • 159. Rank Intuitions: Kendall Segment lengths represent R1 rank scores. 159
  • 160. What about ties? The position of an object within set of objects with the same scores in the rankings affects the rank correlation. For example, red positioning of oj leads to lower Spearman’s coefficient; green – higher. 160
  • 161. Ties • Kendall: Strict discordance: • Spearman: – Can use per entity upper and lower bounds. – Do as in the Olympics: 161
  • 162. Ties: Kendall TauB http://en.wikipedia.org/wiki/Kendall_tau#Tau-b where: is the number of concordant pairs. is the number of discordant pairs. is the number of objects in the two rankings. 162
  • 163. Uses of popularity Popularity can be used to augment gain in NDCG by linearly scaling it: 1 3 7 15 1 2 3 4 31 5 perfectexcellentgoodfairpoor 163
  • 164. Next Steps • How to determine popularity of new entities – Challenge: No historical data. – Usually there is an initial period of high popularity (e.g., a new restaurant is featured in local paper, promotions, etc.). • Good abandonment (no user clicks but good entity in terms of satisfying the user information needs, e.g., phone number). – Use number impressions for named queries. 164
  • 165. References 1. Yaser S. Abu-Mostafa, Malik Magdon-Ismail & Hsuan-Tien Lin (2012) Learning From Data. AMLBook. [link] 2. Ethem Alpaydin (2009) Introduction to Machine Learning. 2nd edition. Adaptive Computation and Machine Learning series. MIT Press. [link] 3. David Barber (2012) Bayesian Reasoning and Machine Learning. Cambridge University Press. [link] 4. Ricardo Baeza-Yates & Berthier Ribeiro-Neto (2011) Modern Information Retrieval: The Concepts and Technology behind Search. 2nd Edition. ACM Press Books. [link] 5. Alex Bellos (2010) Alex's Adventures in Numberland. Bloomsbury: New York. [link] 6. Ron Bekkerman, Mikhail Bilenko & John Langford (2011) Scaling up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press. [link] 7. Christopher M. Bishop (2007) Pattern Recognition and Machine Learning. Information Science and Statistics. Springer. [link] 8. George Casella & Roger L. Berger (2001) Statistical Inference. 2nd edition. Duxbury Press. [link] 9. Anirban DasGupta (2011) Probability for Statistics and Machine Learning: Fundamentals and Advanced Topics. Springer Texts in Statistics. Springer. [link] 10. Luc Devroye, László Györfi & Gábor Lugosi (1996) A Probabilistic Theory of Pattern Recognition. Springer. [link] 11. Richard O. Duda, Peter E. Hart & David G. Stork (2000) Pattern Classification. 2nd Edition. Wiley-Interscience. [link] 12. Trevor Hastie, Robert Tibshirani & Jerome Friedman (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd Edition. Springer Series in Statistics. Springer. [link] 13. James L. Johnson (2008) Probability and Statistics for Computer Science. Wiley-Interscience. [link] 14. Daphne Koller & Nir Friedman (2009) Probabilistic Graphical Models: Principles and Techniques. Adaptive Computation and Machine Learning series. MIT Press. [link] 15. David J. C. MacKay (2003) Information Theory, Inference and Learning Algorithms. Cambridge University Press. [link] 16. Zbigniew Michalewicz & David B. Fogel (2004) How to Solve It: Modern Heuristics. 2nd edition. Springer. [link] 17. Tom M. Mitchell (1997) Machine Learning. McGraw-Hill (Science/Engineering/Math). [link] 18. Mehryar Mohri, Afshin Rostamizadeh & Ameet Talwalkar (2012) Foundations of Machine Learning. Adaptive Computation and Machine Learning series. MIT Press. [link] 19. Lior Rokach (2010) Pattern Classification Using Ensemble Methods. World Scientific. [link] 20. Gilbert Strang (1991) Calculus. Wellesley-Cambridge Press. [link] 21. Larry Wasserman (2010) All of Statistics: A Concise Course in Statistical Inference. Springer Texts in Statistics. Springer. [link] 22. Sholom M. Weiss, Nitin Indurkhya & Tong Zhang (2010) Fundamentals of Predictive Text Mining. Texts in Computer Science. Springer. [link] 165
  • 166. Roadmap • Examples of applications of Machine Learning • Encoding objects with features • The Machine Learning framework • Linear models – Perceptron, Winnow , Logistic Regression, Robust Risk Minimization (RRM) • Tree models (Decision Trees DTs) – Classification Decision Trees, Regression Trees • Boosting – AdaBoost • Ranking evaluation – Kendall tau and Spearman’s coefficient • Sequence labeling – Hidden Markov Models (HMMs) 166
  • 167. SEQUENCE LABELING: HIDDEN MARKOV MODELS (HMMs) 167
  • 168. 168 Outline • The guessing game • Tagging preliminaries • Hidden Markov Models • Trellis and the Viterbi algorithm • Implementation (Python) • Complexity of decoding • Parameter estimation and smoothing • Second order models
  • 169. 169 The Guessing Game • A cow and duck write an email message together. • Goal – figure out which word is written by which animal. The cow/duck illustration of HMMs is due to Ralph Grishman (NYU).
  • 170. 170 What’s the Big Deal ? • The vocabularies of the cow and the duck can overlap and it is not clear a priori who wrote a certain word!
  • 171. 171 The Game (cont) ? ? moo hello ? quack COW ? moo hello DUCK quack
  • 172. The Game (cont) COW COW moo hello DUCK quack DUCK 172
  • 173. What about the Rest of the Animals? ZEBRA ZEBRA word1 word2 ZEBRA word3 PIG ZEBRA word4 ZEBRA word5 PIG DUCK COW ANT DUCK COW ANT PIG DUCK COW ANT PIG DUCK COW ANT PIG DUCK COW ANT 173
  • 174. A Game for Adults • Instead of guessing which animal is associated with each word guess the corresponding POS tag of a word. Pierre/NNP Vinken/NNP ,/, 61/CD years/NNS old/JJ ,/, will/MD join/VB the/DT board/NN as/IN a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ./. 174
  • 175. 175 POS Tags "CC", "CD", "DT", "EX", "FW", "IN", "JJ", "JJR", "JJS", "LS", "MD","NN", "NNS","NNP", "NNPS", "PDT", "POS", "PRP", "PRP$", "RB", "RBR", "RBS", "RP", "SYM", "TO", "UH", "VB", "VBD", "VBG", "VBN", "VBP", "VBZ", "WDT", "WP", "WP$", "WRB", "#", "$", ".",",", ":", "(", ")", "`", "``", "'", "''"
  • 176. 176 Tagging Preliminaries • We want the best set of tags for a sequence of words (a sentence) • W — a sequence of words • T — a sequence of tags )|(maxarg ^ WTPT T
  • 177. 177 Bayes’ Theorem (1763) )( )()|( )|( WP TPTWP WTP posterior priorlikelihood marginal likelihood Reverend Thomas Bayes — Presbyterian minister (1702-1761)
  • 178. 178 Applying Bayes’ Theorem • How do we approach P(T|W) ? • Use Bayes’ theorem • So what? Why is it better? • Ignore the denominator (and the question): )( )()|( maxarg)|(maxarg WP TPTWP WTP TT )()|(maxarg )( )()|( maxarg)|(maxarg TPTWP WP TPTWP WTP TTT
  • 179. 179 Tag Sequence Probability • Count the number of times a sequence occurs and divide by the number of sequences of that length — not likely! – Use chain rule How do we get the probability P(T) of a specific tag sequence T?
  • 180. 180 P(T) is a product of the probability of the N-grams that make it up Make a Markov assumption: the current tag depends on the previous one only: Chain Rule ),...,|(...)|()|()( ),...,()( 11213121 1 nn n tttPtttPttPtP ttPTP history n i iin ttPtPttP 2 111 )|()(),...,(
  • 181. 181 • Use counts from a large hand-tagged corpus. • For bi-grams, count all the ti–1 ti pairs • Some counts are zero – we’ll use smoothing to address this issue later. Transition Probabilities )( )( )|( 1 1 1 i ii ii tc ttc ttP
  • 182. 182 What about P(W|T) ? • First it's odd—it is asking the probability of seeing “The white horse” given “Det Adj Noun”! – Collect up all the times you see that tag sequence and see how often “The white horse” shows up … • Assume each word in the sequence depends only on its corresponding tag: n i ii twPTWP 1 )|()|( emission probabilities
  • 183. 183 Emission Probabilities • What proportion of times is the word wi associated with the tag ti (as opposed to another word): )( ),( )|( i ii ii tc twc twP
  • 185. Hidden Markov Models • Stochastic process: A sequence 1 , 2,… of random variables based on the same sample space . • Probabilities for the first observation: • Next step given previous history: jj xxP outcomeeachfor)( 1 ),...,|( 11 11 tt itiit xxxP 185
  • 186. • A Markov Chain is a stochastic process with the Markov property: • Outcomes are called states. • Probabilities for next step – weighted finite state automata. 186 Markov Chain )|(),...,|( 111 111 tttt itititiit xxPxxxP
  • 187. 187 State Transitions w/ Probabilities START END COW DUCK 1.0 0.2 0.2 0.3 0.3 0.5 0.5
  • 188. 188 Markov Model Markov chain where each state can output signals (like “Moore machines”): START END COW DUCK 1.0 0.2 0.2 0.3 0.3 0.5 0.5 moo:0.9 hello:0.1 hello:0.4 quack:0.6 $:1.0^:1.0
  • 189. 189 The Issue Was • A given output symbol can potentially be emitted by more than one state — omnipresent ambiguity in natural language.
  • 190. Markov Model 190 },...,{ 1 mss },...,{ 1 k )|(where][P 1 itjtijij ssPpp )|(where][A itjtijij sPaa )(where],...,[ 11 jjm sPvvvv Finite set of states: Signal alphabet: Transition matrix: Emission probabilities: Initial probability vector:
  • 192. • A Markov Model for which it is not possible to observe the sequence of states. • S: unknown — sequence of states • O: known — sequence of observations 192 )|(maxarg OSP S Hidden Markov Model *S *O wordstags
  • 193. 193 The State Space START END COW DUCK 1.0 0.0 0.2 0.2 0.3 0.3 0.5 0.5 moo:0.9 hello:0.1 hello:0.4 quack:0.6 moo hello quack COW DUCK 0.3 0.3 0.5 0.5 COW DUCK More on how the probabilities come about (training) later.
  • 194. 194 Optimal State Sequence: The Viterbi Algorithm We define the joint probability of the most likely sequence from time 1 to time t ending in state si and the observed sequence O≤t up to time t: );,,...,(max );,(max)( 11 11 1 11 ,..., 1 tititi ss titt S t OsssP OsSPi t tii t
  • 195. 195 Key Observation The most likely partial derivation leading to state si at position t consists of: – the most likely partial derivation leading to some state sit-1 at the previous position t-1, – followed by the transition from sit-1 to si.
  • 196. Note: We will show that: )|(and)(where)( 111 11 itktikiiiki sPasPvavi 196 Viterbi (cont) tjkijt i t apij ])([max)( 1
  • 198. • The predecessor of state si in the path corresponding to t(i) : • Optimal state sequence: 198 Back Pointers 1,...,1for)( )(argmax ))((argmax)( ** 1 * 1 1 11 ntss is pij ttt T kk n mi k ijt mi t
  • 199. 199 The Trellis START COW moo hello quack DUCK END 0 t=0 1 0 0 t=1 t=2 t=3 t=4 0 0.9 0 0 0 0 0 0 0 $ 0.045 0.108 0.00648 0 0.0081 0 0.0324 0
  • 200. 200 Implementation (Python) observations = ['^','moo','hello','quack','$'] # signal sequence states = ['start','cow','duck','end'] # Transition probabilities - p[FromState][ToState] = probability p = {'start': {'cow':1.0}, 'cow': {'cow' :0.5, 'duck':0.3, 'end' :0.2}, 'duck': {'duck':0.5, 'cow' :0.3, 'end' :0.2}} # Emission probabilities; special emission symbol '$' for 'end' state a = {'cow' : {'moo' :0.9,'hello':0.1, 'quack':0.0, '$':0.0}, 'duck': {'moo' :0.0,'hello':0.4, 'quack':0.6, '$':0.0}, 'end' : {'moo' :0.0,'hello':0.0, 'quack':0.0, '$':1.0}}
  • 201. 201 Implementation (Viterbi) n = len(observations) # Initializing viterbi table row by row; v[state][time] = prob v = {}; for s in states: v[s] = [0.0] * T # Initializing back pointers backPointer = {}; for s in states: backPointer[s] = [""] * T v['start'][0] = 1.0 for t in range(T-1): # t =[0..T-1]; populate column t+1 in v for s in states[:-1]: # 'end' state not considered # only consider 'start' state at time 0 if t == 0 and s != 'start': continue for s1 in p[s].keys(): # s1 is the next state newScore = v[s][t] * p[s][s1] * a[s1][observations[t+1]] if v[s1][t+1] == 0.0 or newScore > v[s1][t+1]: v[s1][t+1] = newScore backPointer[s1][t+1] = s
  • 202. 202 Implementation (Best Path) # Now recover the optimal state sequence state_sequence = [ 'end' ] for t in range( n-1, 0, -1 ): state = backPointer[ state ][ t ] state_sequence = [ state ] + state_sequence print "Observations....: ", observations print "Optimal sequence: ", state_sequence
  • 203. 203 Complexity of Decoding • O(m2n) — linear in n (the length of the string) • Initialization: O(mn) • Back tracing: O(n) • Next step: O(m2) for current_state in s1..sm # at time t+1 for prev_state in s1..sm # at time t compute value compare with best_so_far • There are n next steps.
  • 204. 204 Parameter Estimation for HMMs • Need annotated training data (Brown, PTB). • Signal and state sequences both known. • Calculate observed relative frequencies. • Complications — sparse data problem (need for smoothing). • One can use only raw data too — Baum-Welch (forward- backward) algorithm.
  • 205. 205 Optimization • Build vocabulary of possible tags for words • Keep total counts for words • If a word occurs frequently (count > threshold) consider its tag set exhaustive • For frequent words only consider its tag set (vs. all tags) • For unknown words don’t consider tags corresponding to closed class words (e.g., DT)
  • 206. 206 Applications Using HMMs • POS tagging (as we have seen). • Chunking. • Named Entity Recognition (NER). • Speech recognition.
  • 207. 207 Exercises • Implement the training (parameter estimation). • Use a dictionary of valid tags for known words to constrain which tags are considered for a word. • Implement a second-order model. • Implement the decoder in Ruby.
  • 208. 208 Some POS Taggers • Alias-I: http://www.alias-i.com/lingpipe • AUTASYS: http://www.phon.ucl.ac.uk/home/alex/project/tagging/tagging.htm • Brill Tagger: http://www.cs.jhu.edu/~brill/RBT1_14.tar.Z • CLAWS: http://www.comp.lancs.ac.uk/computing/research/ucrel/claws/trial.html • Connexor: http://www.connexor.com/software/tagger • Edinburgh (LTG): http://www.ltg.ed.ac.uk/software/pos/index.html • FLAT (Flexible Language Acquisition Tool): http://lanaconsult.com • fnTBL: http://nlp.cs.jhu.edu/~rflorian/fntbl/index.html • GATE: http://gate.ac.uk • Infogistics: http://www.infogistics.com/posdemo.htm • Qtag: http://www.english.bham.ac.uk/staff/omason/software/qtag.html • SNoW: http://l2r.cs.uiuc.edu/~cogcomp/asoftware.php?skey=POS • Stanford: http://nlp.stanford.edu/software/tagger.shtml • SVMTool: http://www.lsi.upc.edu/~nlp/SVMTool • TNT: http://www.coli.uni-saarland.de/~thorsten/tnt • Yamcha: http://chasen.org/~taku/software/yamcha/
  • 209. 209 References 1. Brants, Thorsten. 2000. TnT – A Statistical Part-of-speech Tagger. 6th Applied NLP Conference (ANLP-2000), 224-231, Seattle, U.S.A. 2. Charniak, Eugene, Curtis Hendrickson, Neil Jacobson & Mike Perkowitz. 1993. Equations for part-of-speech tagging. 11th National Conference on Artificial Intelligence, 784-789. Menlo Park: AAAI Press/MIT. 3. Krenn, Brigitte & Christer Samuelsson. 1997. Statistical Methods in Computational Linguistics, ESSLLI Summer school Lecture Notes from, 11-22 August, Aix-en-Provence, France. 4. Rabiner, Lawrence R. 1989. A tutorial on Hidden Markov Models and selected applications in speech recognition, Proceedings of the IEEE, vol. 77, 256-286. 5. Samuelsson, Christer. 2000. Extending N-gram tagging to word graphs, Recent Advances in Natural Language Processing II, ed. by Nicolas Nicolov & Ruslan Mitkov. Current Issues in Linguistic Theory (CILT), vol. 189, pp 3-20. John Benjamins: Amsterdam/Philadelphia. 6. Shin, Jung Ho, Young S. Han & Key-Sun Choi. 1997. A HMM part-of-speech tagger for Korean with wordphrasal relations. Recent Advances in Natural Language Processing, ed. by Nicolas Nicolov & Ruslan Mitkov. Current Issues in Linguistic Theory (CILT) vol 136, pp 439-450. John Benjamins: Amsterdam/Philadelphia.
  • 210. Statistics Refresher • Outcome: Individual atomic results of a (non-deterministic) experiment. • Event: A set of results. • Probability: Limit of target outcome over number of experiments (frequentist view) or degree of belief (Bayesian view). • Normalization condition: Probabilities for all outcomes sum to 1. • Distribution: Probabilities associated with each outcome. • Random variable: Mapping of the outcomes to real numbers. • Joint distributions: Conducting several (possibly related) experiments and observing the results. Joint distribution states the probability for a combination of values of several random variables. • Marginal: Finding the distribution of a random variable from a joint distribution. • Conditional probability (Bayes’ rule): Knowing the value of one variable constrains the distribution of another. • Probability density functions: Probability that a continuous variable is in a certain range. • Probabilistic reasoning: Introduce evidence (set certain variables) and compute probabilities of interest (conditioned on this evidence). 210
  • 212. Intuitions about Scale Weight in grams if the Earth were to be a black hole. Age of the universe in seconds. Number of cells in the human body (100 trillion). Number of neurons in the human brain. Standard Blu-ray disc size, XL 4 layer (128GB). One year in seconds. Items in the Library of Congress (largest in the world). Length of the Niles in meters (longest river). 212
  • 213. Acknowledgements • Bran Boguraev • Chris Brew • Jinho Choi • William Headden • Jingjing Li • Jason Kessler • Mike Mozer • Shumin Wu • Tong Zhang • Amir Padovitz • Bruno Bozza • Kent Cedola • Max Galkin • Manuel Reyes Gomez • Matt Hurst • John Langford • Priyank Singh 213