SlideShare a Scribd company logo
1 of 213
Machine Learning
with Applications in Categorization, Popularity and Sequence labeling
              (linear models, decision trees, ensemble methods, evaluation)
                                                      Dr. Nicolas Nicolov
                                                         <1st_last@yahoo.com>
Goals
• Introduce important ML concepts
• Illustrate ML techniques through examples in:
   – Categorization
   – Popularity
   – Sequence labeling




(tutorial aims to be self-contained and to explain the notation)

                                                                   2
Outline
•   Examples of applications of Machine Learning
•   Encoding objects with features
•   The Machine Learning framework
•   Linear models
     – Perceptron, Winnow, Logistic Regression, Robust Risk Minimization (RRM)
• Tree models (Decision Trees DTs)
     – Classification Decision Trees, Regression Trees
• Boosting
     – AdaBoost
• Ranking evaluation
     – Kendall tau and Spearman’s coefficient
• Sequence labeling
     – Hidden Markov Models (HMMs)

                                                                                 3
EXAMPLES OF MACHINE LEARNING
Why?– Get a flavor of the diversity of areas where ML is applied.




                                                                    4
Sequence Labeling
                        (like search query analysis)



                                                         Geo-Political Entity


   PER_     _PER_     _PER                 X           GPE


 George      W.       Bush         discussed           Iraq


<PER>George W. Bush</PER> discussed <GPE>Iraq</GPE>

George W. Bush discussed Iraq



                                                                     5
Spam

      www.dietsthatwork.com

www    .   dietsthatwork           .   com
                   further segmentation

www    .   diets that work             .   com
                  classification

              SPAM!




                                                 6
Tokenization
            What!?I love the iphone:-)



         What    !?   I   love   the   iphone   :-)




How difficult can that be? — 98.2% [Zhang et al. 2003]

                 NO TRESSPASSING
                 VIOLATORS WILL
                 BE PROSECUTED


                                                         7
NL Parsing
                                                                             syntactic structure




    PREP            CONTR
                                                                      DOBJ
                                                                                        MANR
       POSS
                                    SUBJ           DET
                            DET                          MOD
              MOD                                               MOD


Unlike my sluggish Chevy the      Audi     handles the winding mountain roads superbly




                                                                                           8
State Transitions
                            LEFTARC:

                                             λ λ   β

                            RIGHTARC:

                                             λ λ   β
  λ      β
                            NOARC:

                                             λ λ   β

                             SHIFT:

             using ML to make the decision
                                               λ   λ β
                  which action to take                   9
Two Ladies in a Men’s Club




                             10
SUBJ       IOBJ


                                    We      serve       men
                                                    We serve food to men.
                                                    We serve our community.
                                                    serve —IndirectObject men




       SUBJ            DOBJ


We            serve           men
We serve organic food.
We serve coffee to connoiseurs.
serve —DirectObject men




                                                                      11
Coreference
      Audi is an automaker that makes luxury cars
      and SUVs. The company was born in
      Germany .
           It was established by August Horch in
      1910. Horch had previosly founded another
      company and his models were quite
      popular. Audi started with four cylinder
      models. By 1914, Horch 's new cars were
      racing and winning.
          August Horch left the Audi company in
      1920 to take a position as an industry
      representative for the German motor
      vehicle industry federation.
          Currently Audi is a subsidiary of the
      Volkswagen group and produces cars of
      outstanding quality.
                                                    12
Parts of Objects (Meronymy)




[…] the interior seems upscale with leatherette upholstery that looks and
feels better than the real cow hide found in more expensive vehicles, a
dashboard accented by textured soft-touch materials, a woven mesh
headliner, and other materials that give the New Beetle’s interior a
sense of quality. […] Finally, and a big plus in my book, both front seats were
height adjustable, and the steering column tilted and telescoped for
optimum comfort.
                                                                            13
Sentiment Analysis

                                            Positive   Negative



                    Xbox




                   Xbox




I love pineapple nearly as much as I hate bananas.

       POSITIVE sentiment regarding topic pineapple.

                                                                  14
Chinese Sentiment


  Sentence

  Car aspects   Sentiment categories




                                  15
16
17
Categorization
• High-level task:
  – Given a restaurant what is its restaurant sub-category?


• Encoding entities with features
• Feature selection                             non-standard order

• Linear models                                 “Though this be madness,
                                                yet there is method in't.”




                                                                     18
Roadmap
•   Examples of applications of Machine Learning
•   Encoding objects with features
•   The Machine Learning framework
•   Linear models
     – Perceptron, Winnow , Logistic Regression, Robust Risk Minimization (RRM)
• Tree models (Decision Trees DTs)
     – Classification Decision Trees, Regression Trees
• Boosting
     – AdaBoost
• Ranking evaluation
     – Kendall tau and Spearman’s coefficient
• Sequence labeling
     – Hidden Markov Models (HMMs)

                                                                                  19
ENCODING OBJECTS WITH FEATURES
Why?– ML algorithms are “generic”; most of them are cast as solutions around vector encodings of the
domain objects. Regardless of the ML algorithm we will need to represent/encode the domain objects as
feature vectors. How well we do this (the quality of features) directly impacts system performance.


                                                                                                        20
Flat
                                                                               Object
                                                                             Encoding



Can be a set;
object can belong                                                                           Number of
to several classes.                                                                       features can
                                                                                           be millions.

                 37    1       0        0       1       1       1       0         1   …




                      Machine learning (training) instance/example/observation.               21
Structured Objects
                          to Strings
                         to Features                                           Table can be quite large.



Structured object:                                                   Feature string    Feature index
                       Read as field “f2:f4” contains feature “a”.   *DEFAULT*                0
f1                                                                   …                        …
f2                                                                   f2:f4>a                 100
     f4   abcde
                             “f2:f4>a”                               f2:f4>b                 101
     f5                      “f2:f4>b”            uni-grams
                             “f2:f4>c”                               f2:f4>c                 102
f3                           …                                       …                        …
     f6                      “f2:f4>a_b”                             f2:f4>a_b               105
                             “f2:f4>b_c”          bi-grams
                             “f2:f4>c_d”                             f2:f4>b_c               106
                             …                                       f2:f4>c_d               107
                             “f2:f4>a_b_c”
                                                  tri-grams          …                        …
                             “f2:f4>b_c_d”
                                                                     f2:f4>a_b_c             109 22
Sliding Window (bi-grams)
                       SkyCity   at   the   Space     Needle
                                               add initial “^” and final “$” tokens

                   ^   SkyCity   at   the   Space     Needle         $

sliding window
                   ^   SkyCity   at   the   Space     Needle         $


                   ^   SkyCity   at   the   Space     Needle         $



                   ^   SkyCity   at   the   Space     Needle         $



                   ^   SkyCity   at   the   Space     Needle         $
                                                                                      23
Example: Feature Templates
public static List<string> NGrams( string field )      could add field name as argument and prefix all features
{
    var featutes = new List<string>();
    string[] tokens = field.Split( spaceCharArr, System.StringSplitOptions.RemoveEmptyEntries );

    featutes.Add( string.Join( "", field.Split(SPLIT_CHARS) ) ); // the entire field

    string unigram = string.Empty, bigram, previous1 = "^", previous2 = "^", trigram;

    for (int i = 0; i < tokens.Length; i++)
    {
         unigram = tokens[ i ];
         featutes.Add(unigram);

         bigram = previous1 + "_" + unigram;                        initial bigram is “^_tokens*0]"
         featutes.Add( bigram );

         if ( i >= 1 ) { trigram = previous2 + "_" + bigram; featutes.Add( trigram ); }

         previous2 = previous1;
                                                              initial tri-gram is: "^_tokens[0]_tokens[1] "
         previous1 = unigram;
    }
    featutes.Add( unigram + "_$" );
    featutes.Add( bigram + "_$" );        last trigram is “tokens*tokens.Length-2]_tokens[tokens.Length-1]_$"

    return result;
                                                                                                              24
}
The Art of Feature Engineering:
               Disjunctive Features
• Useful feature = triggers often and with a particular class.
• Rarely occurring (but indicative of a class) features can be
  combined in a disjunction. This results in:
     – Need for less data to achieve good performance.
     – Final system performance (with all available data) is higher.
• How can we get insights about such features: Error analysis!

Regex ITALIAN_FOOD = new Regex(@"al dente|agnello|alfredo|antipasti|antipasto|arrabbiata|bistecca|bolognese|
branzino|caprese|carbonara|carpaccio|cioppino|cozze|fettuccine|filetto|focaccia|frutti di mare|funghi|
gnocchi|gorgonzola|insalata|lasagna|linguine|linguini|macaroni|minestrone|mozzarella|ossobuco|panini| panino|
parmigiana|pasticcio|pecorino|penne|pepperoncini|pesce|pesto|piatti|piatto|piccata|polpo|pomodori|prosciutto|
radicchio|ravioli|ricotta|rigatoni|risotto|saltimbocca|scallopini|scaloppini|spaghetti|tagliatelle|tiramisu|
tortellini|vitello|vongole");

if (ITALIAN_FOOD.Match(entity.description).Success) features.Add("Italian_Food_Matched_Description");


                 Triggering of the feature.                           Up to us how we call the feature.
                                                                                                          25
Generic Nature of ML Systems

 human sees




                                            Indices of (binary) features that trigger.


                instance( class= 7, features=[0,300857,100739,200441,...])
computer “sees” instance( class=99, features=[0,201937,196121,345758,13,...])
                instance( class=42, features=[0,99173,358387,1001,1,...])
                ...
                                                    Number of features that trigger for individual
                                                    instances are often not the same.        26
     Default feature always triggers.
Training Data

                Instance /w outcome.




                                27
Feature Selection
•   Templates: powerful way to get lots of features.
•   We get too many features.                      e.g., 20M for dependency parsing.

•   Danger of overfitting.        Doing well on seen data but poorly on unseen data.

•   Feature selection:            Automatic ways of finding discriminative features.

     –   CountCutOff.
     –   TFxIDF.
     –   Mutual information.
     –   Information gain.
     –   Chi square.                        We will examine in detail the implementation of this.




                                                                                               28
Mutual Information




                     29
Information Gain
Balances effects of feature triggering for an object with
the effects of feature being absent for an object.




                                                            30
Chi Square




float Chi2(int a, int b, int c, int d) {
   return (a+b+c+d)* ((a*d-b*c)^2) / ((a+b)*(a+c)*(c+d)*(b+d));
}




                                                                  31
Exponent(Log) Trick
  While the final output may not be big intermediate results are. Solution:




float Chi2(int a, int b, int c, int d)
{
   return
   (a+b+c+d) * ((a*d-b*c)^2) /
      ((a+b)*(a+c)*(c+d)*(b+d));
}



float Chi2_v2(int a, int b, int c, int d)
{
   double total = a + b + c + d;
   double n = Math.Log(total);
   double num = 2.0 * Math.Log(Math.Abs((a * d) - (b * c)));
   double den = Math.Log(a + b) + Math.Log(a + c) + Math.Log(c + d) + Math.Log(b + d);
   return (float) Math.Exp(n+num-den);
                                                                                         32
}
Chi Square: Score per Feature




                                33
Chi Square Feature Selection
int[]     featureCounts   =   new int[ numFeatures ];
int       numLabels       =   labelIndex.Count;
int[]     classTotals     =   new int[ numLabels ];              // instances with that label.
float[]   classPriors     =   new float[ numLabels ];            // class priors: classTotals[label]/numInstances.
int[,]    counts          =   new int[ numLabels, numFeatures ]; // (label,feature) co-occurrence counts.
int       numInstances    =   instances.Count;

...                  Do a pass over the data and collect above counts.
float[] weightedChiSquareScore = new float[ numFeatures ];
for (int f = 0; f < numFeatures; f++)    // f is a feature index
{
     float score = 0.0f;
     for (int labelIdx = 0; labelIdx < numLabels; labelIdx++)
     {
            int a = counts[ labelIdx, f ];
            int b = classTotals[ labelIdx ] - p;
            int c = featureCounts[ f ] - p;
            int d = numInstances - ( p + q + r );
            if (p >= MIN_SUPPORT && q >= MIN_SUPPORT) {     // MIN_SUPPORT = 5
                         score += classPriors[ labelIdx ] * Chi2( a, b, c, d );
            }
       }
                                                           Weighted average across all classes.
       weightedChiSquareScore[ f ] = score;
}                                                                                                            34
⇒ Summary: Encoding
• Object representation is crucial.
• Humans: good at suggesting features (templates).
• Computers: good at filtering (feature selection).

     The system designer does not have to worry about which feature is more
     important or useful, and the job is left to the learning algorithm to assign
     appropriate weights to the corresponding features. The system designer’s job
     is to define a set of features that is large enough to represent most of the
     useful information, yet small enough to be manageable for the algorithms and
     the infrastructure.



• Feature engineering: Ensuring systems use the “right”
  features.
                                                                                    35
Roadmap
•   Examples of applications of Machine Learning
•   Encoding objects with features
•   The Machine Learning framework
•   Linear models
     – Perceptron, Winnow , Logistic Regression, Robust Risk Minimization (RRM)
• Tree models (Decision Trees DTs)
     – Classification Decision Trees, Regression Trees
• Boosting
     – AdaBoost
• Ranking evaluation
     – Kendall tau and Spearman’s coefficient
• Sequence labeling
     – Hidden Markov Models (HMMs)

                                                                                  36
MACHINE LEARNING
GENERAL FRAMEWORK




                    37
Machine Learning: Representation
Complex decision making:

                                               prediction
                                               (response/dependent variable).
     input/independent variable
                                               Can be qualitative/quantitative
                                               (classification/regression).
                                  classifier




                                                                       38
Notation




           39
Machine Learning



                                  object encoded with features



    Offline
                         Online
   Training      Model            classifier
                         System
  Sub-system

TRAINING
                                  prediction
                                  (response/dependent variable)

                                                        40
Classes of Learning Problems
• Classification: Assign a category to each item (Chinese |
  French | Indian | Italian | Japanese restaurant).
• Regression: Predict a real value for each item (stock/currency
  value, temperature).
• Ranking: Order items according to some criterion (web search
  results relevant to a user query).
• Clustering: Partition items into homogeneous groups
  (clustering twitter posts by topic).
• Dimensionality reduction: Transform an initial representation
  of items into a lower-dimensional representation while
  preserving some properties (preprocessing of digital images).

                                                              41
ML Terminology
•   Examples: Items or instances used for learning or evaluation.
•   Features: Set of attributes represented as a vector associated with an example.
•   Labels: Values or categories assigned to examples. In classification the labels are categories; in
    regression the labels are real numbers.
•   Target: The correct label for a training example. This is extra data that is needed for supervised
    learning.
•   Output: Prediction label from input set of features using a model of the machine learning algorithm.
•   Training sample: Examples used to train a machine learning algorithm.
•   Validation sample: Examples used to tune parameters of a learning algorithm.
•   Model: Information that the machine learning algorithm stores after training. The model is used
    when predicting the output labels of new, unseen examples.
•   Test sample: Examples used to evaluate the performance of a learning algorithm. The test sample is
    separate from the training and validation data and is not made available in the learning stage.
•   Loss function: A function that measures the difference/loss between a predicted label and a true
    label. We will design the learning algorithms so that they minimize the error (cumulative loss across
    all training examples).
•   Hypothesis set: A set of functions mapping features (feature vectors) to the set of labels. The
    learning algorithm chooses one function among those in the hypothesis set to return after training.
    Usually we pick a class of functions (e.g., linear functions) parameterized by a set of free parameters
    (e.g., coefficients of the linear function) and pinpoint the final hypothesis by identifying the
    parameters that minimize the error.
•   Model selection: Process for selecting the free parameters of the algorithm (actually of the function
    in the hypothesis set).                                                                              42
Classification




                       Yes, this is mysterious at this point.



                                   +                       −
   +       +                               −                           −
                               +
                                                   −
   +               +                   −
                                                               −   −
                           +               −
               +
       +                                               −
                                   −
               +                                                           −
   +                   −                       −
                                                                               43
                                               decision boundary
Multi-Class Classification




                             44
One-Versus-All (OVA)
         For each category in turn, create a binary classifier
         where an instance in the data belonging to the
         category is considered a positive example, all other
         examples are considered negative examples.

         Given a new object, run all these binary classifiers
         and see which classifier has the “highest
         prediction”.

         The scores from the different classifiers need to be
         calibrated!




                                                          45
One-Versus-One (OVO)
           For each pair of classes, create binary classifier
           on data labeled as either of the classes.


           How many such classifiers?


           Given a new instance run all classifiers and
           predict class with maximum number of wins.




                                                       46
Errors
“Nobody is perfect, but then again, who wants to be nobody.”




                                             Average error across all instances.
                                             Goal: Minimize the Error.
                                             Beneficial to have differentiable loss function.




                                #misclassified examples
                                (penalty score of 1 for every misclassified example).
                                                                                        47
Error: Function of the Parameters




The cumulative error across all instances is a function of the parameters.


           1

           2




                                                                             48
Evaluation
• Motivation:
  – Benchmark algorithms (which system is better).
  – Tuning parameters during training.




                                                     49
Evaluation Measures

GeneralizationError: Probability to misclassify an instance selected according
to the distribution of the labeled instance space



TrainingError: Percentage of training examples which are correctly classified.

                                                Optimistically biased estimate especially
                                                if the inducer over-fits the (training) data.



Empirical estimation of the generalization error:
• Heldout method
• Re-sampling:
    1. Random resampling
    2. Cross-validation

                                                                                                50
Precision, Recall and F-measure
                                                                                  General Setup
Let’s consider binary classification:

                                                              Space of all instances



                                                                System identified these as
                                                                negative and got them correct
                                                                (true negative).

                      System identified
                      these as positive   System identified   System identified
                      but got them        these as positive   these as negative
                      wrong               but got them        but got them
                      (false positive).   correct             wrong
                                          (true positive).    (false negative).




     Instances identified as                                                           Positive instances in reality.
     positive by the system.
                                                                                                                  51
Definitions


         Accuracy, Precision, Recall,
              and F-measure

                                              TN: true negatives   Precision:
 FP: false positives

                       TP:
                       true positives

                                        FN: false negatives
                                                                   Recall:




                                                                   F-measure:   Harmonic mean of
Accuracy:                                                                       precision and recall



                                                                                                52
Accuracy vs. Prec/Rec/F-meas
Accuracy can be misleading for evaluating a model with an imbalanced distribution of
the class. When there are more majority class instances than minority class instances,
predicting always the majority class gives good accuracy.

Precision and recall (together) are better indicators.

As a single, aggregate number f-measure favors the lower of the precision or recall.




                                                                                       53
Extreme Cases for Precision & Recall
     all instances

                               TN: true negatives

       TP:
       true positive
                      FN: false negatives


   system                               actual




 all instances                      system

FP: false positives



            TP: true positives



                                                    Precision can be traded for recall and vice 54
                                                                                                versa.
                                        actual
Definitions



                      Sensitivity & Specificity
                                                   TN: true negatives


       FP: false positives
                                                                                       [same as recall;
                             TP:                                                       aka true positive rate]
                             true positives

                                              FN: false negatives




                                                                                       [aka true negative rate]




False positive rate:                                                    False negative rate:

                                                                                                      55
Venn Diagrams
                      These visualization diagrams were introduced by John Venn:

                          John Venn (1880) “On the Diagrammatic and Mechanical
                          Representation of Propositions and Reasonings”, Philosophical
                          Magazine and Journal of Science, 5:10(59).


What if there are three classes?


Four classes?

                                                          With more classes our visual intuitions
                                                          are helping less and less.

                                                          A subtle point: These are just the
                                                          actual/real classes without the system
Six classes?                                              classes drawn on top!


                                                                                           56
Confusion Matrix
Shows how the predictions of instances of an actual class are distributed across all classes.
Here is an example confusion matrix for three classes:



                         Predicted class A       Predicted class B      Predicted class C

                       Number of instances     Number of instances
                       in the actual class A   in the actual class A                         Total number of actual
      Actual class A                                                            …
                       AND predicted as        BUT predicted as                              instances of class A
                       belonging to class A.   belonging to class B.
                                                                                             Total number of actual
      Actual class B            …                       …                       …
                                                                                             instances of class B
                                                                                             Total number of actual
      Actual class C            …                       …                       …
                                                                                             instances of class C
                       Total number of         Total number of         Total number of       Total number of instances
                       instances predicted     instances predicted     instances predicted
                       as class A              as class B              as class C




Counts on the diagonal are the true positives for each class. Counts not on the diagonal are errors.
Confusion matrices can handle many classes.                                                   57
Confusion Matrix:
        Accuracy, Precision and Recall
Given a confusion matrix, it’s easy to compute accuracy, precision and recall:


                        Predicted class A   Predicted class B     Predicted class C
       Actual class A          50                  80                     70             200
       Actual class B          40                 140                     120            300
       Actual class C         120                 220                     160            500
                              210                 440                     350            1000




                                                                                                             58
                                                        Confusion matrices can, themselves, be confusing sometimes 
Roadmap
•   Examples of applications of Machine Learning
•   Encoding objects with features
•   The Machine Learning framework
•   Linear models
     – Perceptron, Winnow , Logistic Regression, Robust Risk Minimization (RRM)
• Tree models (Decision Trees DTs)
     – Classification Decision Trees, Regression Trees
• Boosting
     – AdaBoost
• Ranking evaluation
     – Kendall tau and Spearman’s coefficient
• Sequence labeling
     – Hidden Markov Models (HMMs)

                                                                                  59
LINEAR MODELS
Why?– Linear models are good way to learn about core ML concepts.




                                                                    60
Refresher: Vectors
                        vector                     vector



             point               point                        vector



                                                points are also vectors.


                                                            Equation of the line.
                                                            Can be re-written as:




sum of vectors



                                         vector notation




                                                                               61
                                                                           transpose
Refresher: Vectors (2)


                                             Equation of the line.
                                             Can be re-written as:


Normal vector.




                           vector notation




                                                                     62
Refresher: Dot Product




           float DotProduct(float[] v1, float[] v2) {
              float sum = 0.0;
              for(int i=0; i< a.Length; i++) sum+= v1[i] * v2[i];
              return sum;
           }                                               63
Refresher: Pos/Neg Classes

                 +




Normal vector.




                     −




                             64
sgn Function
    In mathematics:




     We will use:




          Informally drawn as:
                                 65
Two Linear Models
The features of an object have associated weights indicating their importance.

Signal:


          Perceptron                               Linear regression




                                                                                 66
Why “Regression”?
Why the term for quantitative output prediction is “regression”?

     “That same year [1875], [Francis] Galton decided to recruit some of his friends for an experiment with
     sweet peas. He distributed seeds among seven of them, asking them to plant the seeds and return the
     offspring. Galton measured the baby seeds and compared their diameters to those of their parents. He
     noticed a phenomenon that initially seems counter-intuitive: the large seeds tended to produce smaller
     offspring, and the small seeds tended to produce larger offspring. A decade later he analyzed data from his
     anthropometric laboratory and recognized the same pattern with human heights. After measuring 205
     pairs of parents and their 928 adult children, he saw that exceptionally tall parents had kids who were
     generally shorter than they were, while exceptionally short parents had children who were generally taller
     than their parents.


     After reflecting upon this, we can understand why it must be the case. If very tall parents always produced
     even taller children, and if very short parents always produced even shorter ones, we would by now have
     turned into a race of giants and midgets. Yet this hasn't happened. Human populations may be getting
     taller as a whole – due to better nutrition and public health – but the distribution of heights within the
     population is still contained.


     Galton called this phenomenon ‘regression towards mediocrity in hereditary stature’. The concept is now
     more generally known as regression to the mean.”
                                                                                              [A.Bellos pp.375]
                                                                                                             67
On-Line (Sequential) Learning
• On-line = process one example at a time.
• Attractive for large scale problems.



                                                             iteration (epoch/time).




                                                            Compute loss.




                                  Objective: Minimize cumulative loss:
    return parameters

                                                                                       68
On-Line (Sequential) Learning (2)
Sometimes written out more explicitly:




                                                             # passes over the data.

                                                                 for each data item.




  return parameters                      return parameters


                                                                            69
Perceptron




        Linearly separable data:                                                 Non-linearly separable data:

                             +                       −                       +                                               + −
+        +                               −                                                   +               +
                                                                 −                                                             −             −
                         +                                                           +                                   +
                                             −                                                                               −
+                +                                                           +               +
                                 −                                                                               +                           −
                     +                                   −   −                                       +                           −
                                     −                                                   +                       −
             +                                                                                   −
    +                                            −                               +                                           −           −
                             −                                                           −
             +                                                       −   +                               −                           −           −
+                    −                   −                                       + −                                 −


                                                                                                                                                     70
First: Perceptron Update Rule

Simplification:
Lines pass through origin.




                                   +




            +
                −            + −
                                                    71
On-Line (Sequential) Learning




                                72
Perceptron Learning Algorithm


                  iteration (epoch/time).




                  return parameters




                                            73
Perceptron Learning Algorithm


                  (algorithm makes multiple passes over data.)




                  return parameters




                                                         74
Perceptron Learning Algorithm (PLA)


        while( mis-classified examples exist ):
                                                                                       Misclassified example means:
                                                                                       With the current weights


                    Update weights:




1.   A challenge: Algorithm will not terminate for non-linearly separable data (outliers, noise).
2.   Unstable: jump from good perceptron to really bad one within one update.
3.   Attempting to minimize:
                                                                            NP-hard.
                                                                                                             75
Perceptron


Weight update:




                              76
Looks Simple – Does It Work?
Margin-based upper bound on updates:


            Number of updates by the Perceptron Algorithm
Fact:

            where:                                          Remarkable:
                                                            Does not depend on
                                                            dimension of feature
                                                            space!




                                                                        77
Compact Model Representation
Use float instead of double:


Store only non-zero weights (and indices):

Store non-zero weights and diff of indices:


void Save( StreamWriter w, int labelIdx, float[] weights )
{
     w.Write( labelIdx );
     int previousIndex = 0;
     for (int i = 0; i < weights.Length; i++)                   Difference of indices.
     {
         if (weights[ i ] != 0.0f) {
              w.Write( " " + (i - previousIndex) + " " + weights[ i ] );
              previousIndex = i;
         }
     }
                                            Remember last index where the weight was non-zero .
}
                                                                                                  78
Linear Classification Solutions



                                         Different solutions (infinitely many)

                                 +                                −
     +       +                                        −                       −
                             +
                                                          −
     +               +                        −
                         +                                            −   −
                                                  −
                 +
         +                                                    −
                                          −
                 +                                                            −
     +                               −                −




                                                                                  79
The Pocket Algorithm
A better perceptron algorithm:
     Keep track of the error and update weights when we lower the error.




                                                                     Compute error. Expensive step!
                                                                     Access to the entire data needed!



                                                                   Only update the best weights
                                                                   if we lower the error!



                                                                                           80
Voted Perceptron
•   Training as the usual perceptron algorithm (with some extra book-keeping).
•   Decision rule:




                                                                          iterations


                                                                                 81
Dual Perceptron: Intuitions

                              +       +
                          +
                              +           +
                      +
                                      +                           separating line.
                      +
                                  +

      normal vector                                       −         −
                                                  −
                                                              −             −
                                              −
                                                                        −
                                                      −




                                                                                     82
Dual Perceptron


                 (algorithm makes multiple passes over data.)




                 return parameters




Decision rule:
                                                        83
Exclusive OR (XOR) Function
Truth table:                            Inputs in and color-coding
                                                    of the output:




   Challenge:
   The data is not linearly separable                                ???
   (no straight line can be drawn
   that separates the green from the
   blue points).



                                                                           84
Solution for the Exclusive OR (XOR)
We introduce
another input
dimension:




                Now the data is linearly separable:




                                                      85
Winnow Algorithm

  iteration (epoch).




                  Normalizing constant.



                         Multiplicative
                         update.


   return parameters

                                          86
Training, Test Error and Complexity


                           Test error

                           Training error

                           Model complexity




                                              87
Logistic Regression
                                           Target:



                                           Data does not give the
                                           probability explicitly:




Logistic function:




                                                               88
Logistic Regression




Data likelihood:

Negative log-likelihood:




Error:
                                           89
Derivative:
                                                              Refresher


                                                                      Chain rule:




Partial derivative:




Gradient (derivatives with respect to each component):



                                                         This is a vector and we
Gradient of the error:                                   can compute it at a point.   90
Hypothesis Space




               Weight space/hyperplane.
                                          91
                              [graph from T.Mitchell]
Math Fact
The gradient of the error:




(a vector in weight space) specifies the direction of the argument that leads to the
steepest increase for the value of the error.

The negative of the gradient gives the direction of the steepest decrease.




                                                        Negative gradient (see next slides).




                                                                                               92
Computing the Gradient

Because gradient is a linear operator.




                                         93
(Batch) Gradient Descent



repeat
         Compute gradient:



         Update weights:


                             max #iterations;
                             marginal error improvement; and
                             small value for the error.

                                                               94
Punch Line




The new object is in the class if:

                                     classification rule.




                                                            95
Newton’s Method




2.5


  2


1.5


  1


0.5


  0
       0   0.5     1   1.5   2     2.5   3
-0.5
                                             96
Newton-Raphson




                 97
Robust Risk Minimization
Notation:
                        input vector
                        label
                        training examples
                        weight vector
                        bias
                        continuous linear model


Prediction rule:




Classification error:

                                                  98
Robust Classification Loss
Parameter estimation:




Hinge loss:




Robust classification loss:




                                       99
Loss Functions: Comparison




                             100
Confidence and Regularization
Confidence




Regularization:




Unconstrained optimization (Lagrange multiplier):




                           smaller λ corresponds to a larger A.   101
Robust Risk Minimization



            Go over the training data.




                                         102
Learning Curve
100
                                                                                   •   Plots evaluation metric
 90               Experiment with 50% of                                               against fraction of
                   the training data yields                                            training data (on the
 80              evaluation number of 70.
                                                                                       same test set!).
 70                                                                                •   Highest performance
                                                                                       bounded by human inter
 60
                                                                                       annotator agreement
 50
                                                                                       (ITA).
                                                                                   •   Leveling off effect that
 40
                                                                                       can guide us how much
 30
                                                                                       data is needed.

 20


 10


  0
      0%   10%       20%      30%     40%     50%   60%   70%   80%   90%   100%
                                                                                          Percentage of data used
                                                                                          for each experiment.
                                                                                                           103
Summary
•   Examples of ML
•   Categorization
•   Object encoding
•   Linear models:
    –   Perceptron
    –   Winnow
    –   Logistic Regression
    –   RRM
• Engineering aspects of ML systems

                                      104
PART II: POPULARITY
                 105
Goal
• Quantify how popular an entity is.



Motivation:
• Used in the new local search relevance metric.




                                              106
What is popularity?




                      107
POPULARITY IN LOCAL SEARCH
                         108
Popularity
•   Output a popularity score (regression)
•   Ensemble methods
•   Tree base procedure (non-linear)
•   Boosting




                                             109
When is a Local Entity Popular?
• Definition:
       Visited by many people in the context of alternative choices.


• Is the popularity of restaurants the same as the popularity of
  movies, etc.?
• How to operationalize “visit”, “many”, “alternative choices”?
   – Initially we are using: popular means clicked more.


• Going forward we will use:
   – “visit” = click given an impression.
   – “choice” = density of entities in the same primary category.
   – “many” = fraction of clicks from impressions.                  110
Local Entity Popularity




The model then will be regression:




                                      111
Not all Clicks are Born the Same
• Click in the context of a named query:
   – Can even be argued we are not satisfying the user
     information needs (and they have to click further to find out
     what they are looking for).
• Click in the context of a category query:
   – Much more significant (especially when alternative results
     are present).




                                                               112
Local Entity Popularity
•   Popularity & 1st page , current ranker.
•   Entities without URL.
•   Newly created entities.
•   Clicks vs. mouseovers.
•   Scenario: 50 French restaurants; best entity
    has 2k clicks. 2 Italian restaurants; best entity
    has 2k clicks. The French entity is more
    popular because of higher available choice.
                                                        113
Entity Representation



9000     8000   …    4000     65      4.7      73          …   1   …


Target                              feature values



                    Machine learning (training) instance



                                                                       114
POISSON REGRESSION
Why?– We will practice the ML machinery on a different problem, re-iterating the concepts.
Poisson regression is an example of log-linear models good for modeling counts (e.g., number
of visitors to a store in a certain time).


                                                                                          115
Setup

                                                            explanatory variables


response/outcome
          variable

                                         These counts for our scenario are the clicks on the web page.




           A good way to model counts of observations is using the Poisson distribution.




                                                                                                         116
Poisson Distribution: Preliminaries
The Poisson distribution realistically describes the pattern of requests over time in many client-server
situations.

Examples are: incoming customers at a bank, calls into a company’s telephone exchange, requests for
storage/retrieval services from a database server, and interrupts to a central processor. It also has higher-
dimensional applications, such as the spatial distribution of defects on integrated circuit wafers and the
volume distribution of contaminants in well water. In such cases, the “events”, which are request arrivals
or defects occurrences, are independent. Customers do not conspire to achieve some special pattern in
their access to a bank teller; rather they operate as independent agents. The manufacture of hard disks
or integrated circuits introduces unavoidable defects because the process pushes the limits of geometric
tolerances. Therefore, a perfectly functional process will still occasionally produce a defect, such as a
small area on the disk surface where the magnetic material is not spread uniformly or a shorted
transistor on an integrated circuit chip. These errors are independent in the sense that a defect at one
point does not influence, for better or worse, the chance of a defect at another point. Moreover, if the
time interval or spatial area is small, the probability of an event is correspondingly small. This is a
characterizing feature of a Poisson distribution: event probability decreases with the window of
opportunity and is linear in the limit. A second characterizing feature, negligible probability of two or
more events in a small interval, is also present in the mentioned examples.



                                                                                                        117
Poisson Distribution: Formally




                                 118
Poisson Distribution: Mental Steps




                                              This comes from the theory of Generalized Linear Models (GLM).




log           linear combination of the input features.

      Hence, the name log-linear model.
                                                                                                               119
Poisson Distribution


Data likelihood:

Log-likelihood:




                                          120
Maximizing the Log-Likelihood




                                121
Roadmap
•   Examples of applications of Machine Learning
•   Encoding objects with features
•   The Machine Learning framework
•   Linear models
     – Perceptron, Winnow , Logistic Regression, Robust Risk Minimization (RRM)
• Tree models (Decision Trees DTs)
     – Classification Decision Trees, Regression Trees
• Boosting
     – AdaBoost
• Ranking evaluation
     – Kendall tau and Spearman’s coefficient
• Sequence labeling
     – Hidden Markov Models (HMMs)

                                                                                  122
DECISION TREES
Why?– DTs are an influential development in ML. Combined in ensemble they provide very competitive
performance. We will see ensemble techniques in the next part.




                                                                                                     123
Decision Trees
                                  Training instances.
                                  Color reflects output variable
                                  (classification example).



                                        Binary partitioning of the data during training
                                        (navigating to leaf node during testing).




      prediction
                                                                    Training instances are
                                                                    more homogeneous
                                                                    in terms of the output variable
                                                                    (more pure) compared to
                                                                    ancestor nodes.

 Stopping when instances
     are homogeneous or                                                                    124
small number of instances.
Decision Tree: Example
                                     (classification example with categorical features)




                                                                              Attribute/feature/predicate

                  Parents
                  Visiting
          Yes                 No                                              Value of the attribute


   Cinema                      Weather
                                                                              Branching factor depends on
                                                                              the number of possible values
          Sunny                Windy          Rainy                           for the attribute (as seen in the
                                                                              training set).
 Play                                                Stay in
tennis             Money

         Rich                 Poor                                            Predicted classes.

     Shopping                Cinema                                                                      125
Entropy   (needed for describing how an attribute is selected.)




                                    1
Example                            0.9
                                   0.8
                                   0.7
                                   0.6
                                   0.5
                                   0.4
                                   0.3
                                   0.2
                                   0.1
                                    0
                                         0   0.2   0.4   0.6   0.8   1




                                                                     126
Selecting an Attribute: Information Gain

Measure of expected reduction in entropy.


      instances   attribute




                                                                          127
                                            See Mitchell’97, p.59 for an example.
Splitting ‘Hairs’



                If there are no instances in the
                current node, inherit statistics
                (majority class) from parent
                node.



                     If there are only a small number
                     of instances do not split the node
                     further (statistics are unreliable).



                     If there is more training data, the
                                               128
                     tree can be “grown” bigger.
?
ID3 Algorithm




                129
Alternative Attribute Selection:
                         Gain Ratio             [Quinlan 1986]

                        instances   attribute




Examples:




all different values.
                                                          130
Alternative Attribute Selection:
           GINI Index    [Corrado Gini: Italian statistician]




                                                      131
Space of Possible Decision Trees




                      Number of possible trees:




                                                  132
Decision Trees and Rule Systems
Path from each leaf node to the root represents a conjunctive rule:



                                                        if  (ParentsVisiting==No) &
                  Parents                                   (Weather==Windy) &
                  Visiting                                  (Money==Poor)
                                                        then
          Yes                 No                            Cinema.

   Cinema                      Weather

          Sunny                Windy     Rainy
 Play                                        Stay in
tennis             Money

         Rich                 Poor

     Shopping                Cinema                                             133
Decision Trees
• Different training sample -> different resulting
  tree (different structure).
• Learning does (conditional) feature selection.




                                                 134
Regression Trees
                          Like classification trees but the
                          prediction is a number
                          (as suggested by “regression”).

                          1. How do we split?
                          2. When to stop?




 predictions
(constants)

                                                         135
Regression Trees: How to Split




                                 136
Regression Trees: Pruning
Tree operation where a pre-terminal gets its two leaves collapsed:




                                                                     137
Regression Trees: How to Stop
1.   Don’t stop.
2.   Build big tree.
3.   Prune.
4.   Evaluate sub-trees.




                                      138
Roadmap
•   Examples of applications of Machine Learning
•   Encoding objects with features
•   The Machine Learning framework
•   Linear models
     – Perceptron, Winnow , Logistic Regression, Robust Risk Minimization (RRM)
• Tree models (Decision Trees DTs)
     – Classification Decision Trees, Regression Trees
• Boosting
     – AdaBoost
• Ranking evaluation
     – Kendall tau and Spearman’s coefficient
• Sequence labeling
     – Hidden Markov Models (HMMs)

                                                                                  139
BOOSTING


           140
Ensemble Methods
                                object encoded with features
classifiers




                          …
                                                   ENSEMBLE

                          …
                                              predictions
                                              (response/dependent
                                              variable)


                              majority voting/averaging


                                                           141
Where the Systems Come from
Sequential ensemble scheme:




                              …



                                  142
Contrast with Bagging
Non-sequential ensemble scheme:

           DATA




                  Datai are independent of each other (likewise for Sytemi).
                                                                       143
Data             System
                                               Base Procedure:
                                                 Decision Tree
                                      Training instances.
                                      Color reflects output variable
                                      (classification example).

                                            Binary partitioning of the data during training
                                            (navigating to leaf node during testing).




      prediction
                                                                        Training instances are
                                                                        more homogeneous
                                                                        in terms of the output variable
                                                                        (more pure) compared to
                                                                        ancestor nodes.

 Stopping when instances
     are homogeneous or                                                                        144
small number of instances.
Ensemble Scheme
TRAINING DATA
                                          base procedure




                                                 base procedure   Small systems.
                                 Original data
                                                                  Don’t need to be
                                                                  perfect.
                                                 base procedure
                             Weighted data


                                                 base procedure
                             Weighted data




 Final prediction (regression)

                                                                            145
Ada Boost (classification)
 Original data

Weighted data

Weighted data

Weighted data




                                      normalizing factor.



                  final prediction.
                                                            146
AdaBoost
 Initializing weights.




                                weight update.

          normalizing factor.


                                             final prediction.
                                                        147
Binary Classifier
• Constraint:
   – Must not have all zero clicks for current week, previous week and week before last
     [shopping team uses stronger constraint: only instances with non-zero clicks for
     current week].
• Training:
   – 1.5M instances.
   – 0.5M instances (validation).
• Feature extraction:
   – 4.82mins (Cosmos job).
• Training time:
   – 2hrs 20mins.
• Testing:
   – 10k instances: 1sec.
                                                                                 148
Roadmap
•   Examples of applications of Machine Learning
•   Encoding objects with features
•   The Machine Learning framework
•   Linear models
     – Perceptron, Winnow , Logistic Regression, Robust Risk Minimization (RRM)
• Tree models (Decision Trees DTs)
     – Classification Decision Trees, Regression Trees
• Boosting
     – AdaBoost
• Ranking evaluation
     – Kendall tau and Spearman’s coefficient
• Sequence labeling
     – Hidden Markov Models (HMMs)

                                                                                  149
POPULARITY
 EVALUATION
          How do we know
we have a good popularity?

                       150
Rank Correlation Metrics

                                                                              • •
                                                                              •    •

                                                                              •   •

                                       The two rankings are the same.



                                       The two rankings are reverse of each other.




 Actual input is a set of objects with two rank scores (ties are possible).       151
Kendall’s Tau Coefficient
Considers concordant/discordant pairs in two
rankings (each ranking w.r.t. the other):




                                               152
What is a concordant pair?


           a   a

           b   c

           c   b




                   Need to have the same sign




                                                153
Kendall Tau: Example
                                           A                                                      C


                                           B                                                      D




                                     C                                                      A


                                    D                                                      B

                    Pairs:
(discordant pairs in red):




                                                                                                                              154
                              Observation: Total number of discordant pairs = 2x the discordant pairs in one ranking w.r.t. the other.
Spearman’s Coefficient
 Considers ranking differences for the same object:




                                                      a    a

                                                      b    c

                                                      c   b

Example:




                                                          155
Rank Intuitions: Setup

1                                             1
2                                             2
3                                             3
4                                             4
 5                                            5
 6                                            6
 7                                            7
 8                                            8
 9                                            9
10                                           10




 The sequence <3,1,4,10,5,9,2,6,8,7> is sufficient to encode the two rankings.
                                                                                 156
Rank Intuitions: Pairs

Rankings in complete agreement.




             Rankings in complete dis-agreement.
                                                   157
Rank Intuitions: Spearman




       Segment lengths represent R1 rank scores.   158
Rank Intuitions: Kendall




      Segment lengths represent R1 rank scores.   159
What about ties?
The position of an object within set of objects with the
same scores in the rankings affects the rank correlation.




                                                                                                     160
         For example, red positioning of oj leads to lower Spearman’s coefficient; green – higher.
Ties
• Kendall: Strict discordance:

• Spearman:
   – Can use per entity upper and lower bounds.
   – Do as in the Olympics:




                                                  161
Ties: Kendall TauB


where:
         is the number of concordant pairs.

         is the number of discordant pairs.
         is the number of objects in the two rankings.




                    http://en.wikipedia.org/wiki/Kendall_tau#Tau-b   162
Uses of popularity
Popularity can be used to augment gain in NDCG by linearly scaling it:




                1     3        7             15                                31

                1     2        3              4                                5

               poor   fair   good           excellent                    perfect




                                                                         163
Next Steps
• How to determine popularity of new entities
   – Challenge: No historical data.
   – Usually there is an initial period of high popularity
     (e.g., a new restaurant is featured in local
     paper, promotions, etc.).


• Good abandonment (no user clicks but good
  entity in terms of satisfying the user
  information needs, e.g., phone number).
   – Use number impressions for named queries.



                                                    164
References
1.    Yaser S. Abu-Mostafa, Malik Magdon-Ismail & Hsuan-Tien Lin (2012) Learning From Data. AMLBook. [link]
2.    Ethem Alpaydin (2009) Introduction to Machine Learning. 2nd edition. Adaptive Computation and Machine Learning series. MIT Press.
      [link]
3.    David Barber (2012) Bayesian Reasoning and Machine Learning. Cambridge University Press. [link]
4.    Ricardo Baeza-Yates & Berthier Ribeiro-Neto (2011) Modern Information Retrieval: The Concepts and Technology behind Search. 2nd
      Edition. ACM Press Books. [link]
5.    Alex Bellos (2010) Alex's Adventures in Numberland. Bloomsbury: New York. [link]
6.    Ron Bekkerman, Mikhail Bilenko & John Langford (2011) Scaling up Machine Learning: Parallel and Distributed Approaches. Cambridge
      University Press. [link]
7.    Christopher M. Bishop (2007) Pattern Recognition and Machine Learning. Information Science and Statistics. Springer. [link]
8.    George Casella & Roger L. Berger (2001) Statistical Inference. 2nd edition. Duxbury Press. [link]
9.    Anirban DasGupta (2011) Probability for Statistics and Machine Learning: Fundamentals and Advanced Topics. Springer Texts in Statistics.
      Springer. [link]
10.   Luc Devroye, László Györfi & Gábor Lugosi (1996) A Probabilistic Theory of Pattern Recognition. Springer. [link]
11.   Richard O. Duda, Peter E. Hart & David G. Stork (2000) Pattern Classification. 2nd Edition. Wiley-Interscience. [link]
12.   Trevor Hastie, Robert Tibshirani & Jerome Friedman (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction.
      2nd Edition. Springer Series in Statistics. Springer. [link]
13.   James L. Johnson (2008) Probability and Statistics for Computer Science. Wiley-Interscience. [link]
14.   Daphne Koller & Nir Friedman (2009) Probabilistic Graphical Models: Principles and Techniques. Adaptive Computation and Machine
      Learning series. MIT Press. [link]
15.   David J. C. MacKay (2003) Information Theory, Inference and Learning Algorithms. Cambridge University Press. [link]
16.   Zbigniew Michalewicz & David B. Fogel (2004) How to Solve It: Modern Heuristics. 2nd edition. Springer. [link]
17.   Tom M. Mitchell (1997) Machine Learning. McGraw-Hill (Science/Engineering/Math). [link]
18.   Mehryar Mohri, Afshin Rostamizadeh & Ameet Talwalkar (2012) Foundations of Machine Learning. Adaptive Computation and Machine
      Learning series. MIT Press. [link]
19.   Lior Rokach (2010) Pattern Classification Using Ensemble Methods. World Scientific. [link]
20.   Gilbert Strang (1991) Calculus. Wellesley-Cambridge Press. [link]
21.   Larry Wasserman (2010) All of Statistics: A Concise Course in Statistical Inference. Springer Texts in Statistics. Springer. [link]
22.   Sholom M. Weiss, Nitin Indurkhya & Tong Zhang (2010) Fundamentals of Predictive Text Mining. Texts in Computer Science. Springer. [link]
                                                                                                                                       165
Roadmap
•   Examples of applications of Machine Learning
•   Encoding objects with features
•   The Machine Learning framework
•   Linear models
     – Perceptron, Winnow , Logistic Regression, Robust Risk Minimization (RRM)
• Tree models (Decision Trees DTs)
     – Classification Decision Trees, Regression Trees
• Boosting
     – AdaBoost
• Ranking evaluation
     – Kendall tau and Spearman’s coefficient
• Sequence labeling
     – Hidden Markov Models (HMMs)

                                                                                  166
SEQUENCE LABELING:
HIDDEN MARKOV MODELS (HMMs)

                              167
Outline
      •   The guessing game
      •   Tagging preliminaries
      •   Hidden Markov Models
      •   Trellis and the Viterbi algorithm
      •   Implementation (Python)
      •   Complexity of decoding
      •   Parameter estimation and smoothing
      •   Second order models

168
The Guessing Game

• A cow and duck write an email message together.
• Goal – figure out which word is written by which animal.




 169                         The cow/duck illustration of HMMs is due to Ralph Grishman (NYU).
What’s the Big Deal ?

• The vocabularies of the cow and the duck can
  overlap and it is not clear a priori who wrote a
  certain word!




 170
The Game (cont)

      ?     ?       ?



      moo   hello   quack




      COW   ?       DUCK



      moo   hello   quack

171
The Game (cont)


      DUCK


COW   COW     DUCK



moo   hello   quack




                      172
What about the Rest of the Animals?

    ANT     ANT     ANT     ANT     ANT


    COW     COW     COW     COW     COW


    DUCK    DUCK    DUCK    DUCK    DUCK


    PIG     PIG     PIG     PIG     PIG


    ZEBRA   ZEBRA   ZEBRA   ZEBRA   ZEBRA



    word1   word2   word3   word4   word5
                                            173
A Game for Adults
• Instead of guessing which animal is associated
  with each word guess the corresponding POS
  tag of a word.



Pierre/NNP Vinken/NNP ,/, 61/CD years/NNS old/JJ ,/,
will/MD join/VB the/DT board/NN as/IN
a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ./.



                                                  174
POS Tags
      "CC", "CD", "DT", "EX", "FW",
      "IN", "JJ", "JJR", "JJS", "LS",
      "MD","NN", "NNS","NNP", "NNPS",
      "PDT", "POS", "PRP", "PRP$", "RB",
      "RBR", "RBS", "RP", "SYM", "TO",
      "UH", "VB", "VBD", "VBG", "VBN",
      "VBP", "VBZ", "WDT", "WP", "WP$",
      "WRB", "#", "$", ".",",",
      ":", "(", ")", "`", "``",
      "'", "''"




175
Tagging Preliminaries

• We want the best set of tags for a sequence of words
  (a sentence)
• W — a sequence of words
• T — a sequence of tags


            ^
           T      arg max P(T | W )
                    T

                                                     176
Bayes’ Theorem (1763)


                                         likelihood               prior

        posterior

                                              P(W | T ) P(T )
                       P(T | W )
                                                 P(W )
                                            marginal likelihood

Reverend Thomas Bayes — Presbyterian minister (1702-1761)

                                                                          177
Applying Bayes’ Theorem
• How do we approach P(T|W) ?
• Use Bayes’ theorem


                                           P(W | T ) P(T )
          arg max P(T | W )        arg max
               T                      T       P(W )

• So what? Why is it better?
• Ignore the denominator (and the question):

                              P(W | T ) P(T )
  arg max P(T | W )   arg max                   arg max P(W | T ) P(T )
     T                   T       P(W )             T


                                                                          178
Tag Sequence Probability

         How do we get the probability P(T)
         of a specific tag sequence T?

• Count the number of times a sequence occurs
  and divide by the number of sequences of that
  length — not likely!
  – Use chain rule


 179
Chain Rule
                                                                     history
       P (T )     P (t1 ,..., t n )
       P (t1 ) P (t 2 | t1 ) P (t3 | t1t 2 ) ... P (t n | t1 ,..., t n 1 )

P(T) is a product of the probability of the N-grams
  that make it up
Make a Markov assumption: the current tag
  depends on the previous one only:
                                               n
                P(t1 ,..., tn )       P(t1 )         P(ti | ti 1 )
                                               i 2
 180
Transition Probabilities

• Use counts from a large hand-tagged corpus.
• For bi-grams, count all the ti–1 ti pairs

                               c(ti 1ti )
               P(ti | ti 1 )
                                c(ti 1 )

• Some counts are zero – we’ll use smoothing to address
  this issue later.


                                                          181
What about P(W|T) ?
• First it's odd—it is asking the probability of seeing “The white
  horse” given “Det Adj Noun”!
   – Collect up all the times you see that tag sequence and see how often “The
     white horse” shows up …
• Assume each word in the sequence depends only on its
  corresponding tag:

                                        n
                  P(W | T )                  P( wi | ti )
                                       i 1
                                                 emission probabilities
                                                                            182
Emission Probabilities

• What proportion of times is the word wi associated with
  the tag ti (as opposed to another word):


                               c( wi , ti )
                P( wi | ti )
                                 c(ti )


 183
The “Standard” Model

      arg max P (T | W )
         T

                             P (W | T ) P (T )
                     arg max
                        T         P (W )
                     arg max P (W | T ) P (T )
                           T
                               n
                     arg max         P ( wi | ti ) P (ti | ti 1 )
                           T   i 1




184
Hidden Markov Models
• Stochastic process:
  A sequence 1 , 2,… of random variables
  based on the same sample space .

• Probabilities for the first observation:
     P(   1     x j ) for each outcome x j

• Next step given previous history:
     P(   t 1    xit 1 |   1   xi1 , ... ,   t   xit )
                                                         185
Markov Chain

• A Markov Chain is a stochastic process with the Markov
  property:

      P(   t 1   xit 1 |   1   xi1 , ... ,   t   xit )   P(   t 1   xit 1 |   t   xit )



• Outcomes are called states.
• Probabilities for next step – weighted finite state
  automata.

                                                                                          186
State Transitions w/ Probabilities
                                0.5




                                COW
                                             0.2
                    1.0



                                                   END
            START         0.3          0.3




                                             0.2


                                DUCK




                                0.5
187
Markov Model

Markov chain                                      0.5          moo:0.9

 where each state                                                    hello:0.1

 can output signals
                                                  COW
                      ^:1.0                                    0.2               $:1.0
                                      1.0


                                                                         END
                              START         0.3          0.3



                                                               0.2

                                                  DUCK
  (like “Moore machines”):                                           hello:0.4

                                                                     quack:0.6
                                                  0.5
188
The Issue Was
• A given output symbol can potentially
  be emitted by more than one state —
  omnipresent ambiguity in natural language.




189
Markov Model
Finite set of states:
             {s1,..., sm}

Signal alphabet:
             { 1,...,   k   }
Transition matrix:
        P [ pij ] where pij     P(   t 1    sj |        t     si )

Emission probabilities:
        A [aij ] where aij      P(    t         j   |   t     si )

Initial probability vector:
        v [v1 ,..., vm ] where v j         P(   1       sj)
                                                                     190
Graphical Model


      STATE       TAG           …


      OUTPUT      word




191
Hidden Markov Model

• A Markov Model for which it is not possible to observe
  the sequence of states.
• S: unknown — sequence of states                S         *
• O: known — sequence of observations            O         *

               arg max P( S | O)
                    S
                              tags       words
                                                           192
The State Space
              moo:0.9                               hello:0.1


                                0.5                       0.5
                   COW                     COW                  COW
       1.0
                                                                        0.2
                        0.3                  0.3

      START                                                                   END

                        0.3                       0.3
      0.0                                                               0.2
                               0.5                      0.5
                   DUCK                    DUCK                 DUCK


                                                   hello:0.4              quack:0.6



                  moo                     hello                 quack


                              More on how the probabilities come about (training) later.
193
Optimal State Sequence:
            The Viterbi Algorithm
We define the joint probability of the most likely sequence from
time 1 to time t ending in state si and the observed sequence O≤t
up to time t:



      t   (i) max P(S             t 1   ,   t     si ; O t )
                S   t 1


               max P(             1         si1 ,...,   t 1    sit 1 ,   t   si ; O t )
              si1 ,..., sit   1




                                                                                          194
Key Observation
The most likely partial derivation leading to state si at
position t consists of:
      – the most likely partial derivation leading to some state sit-1
        at the previous position t-1,
      – followed by the transition from sit-1 to si.




195
Viterbi (cont)

Note:

 1   (i )   vi aik 1   where vi       P(   1     si ) and aik 1     P(   t   k1   |   t   si )



We will show that:

                       t   ( j ) [max          t 1   (i) pij ] a jk t
                                  i



                                                                                           196
Recurrence Equation
      t   ( j)   max P ( S                 t 1   ,        t         s j ;O t )
                 S   t 1


                 max max P ( S                                t 2   ,    t 1          si ,       t         s j ;O         t 1      ,     t         kt   )
                     i       S   t 2


                 max max P (                          t         sj;          t             kt    |S        t 2   ,       t 1            si ; O    t 1   )
                     i       S   t 2
                                                                                     k1

                                                                        P( S         t 2   ,         t 1         si ; O t 1 )
                 max max P (                          t         sj |           t 1           si ) P (                t             kt    |    t    sj)
                     i       S   t 2

                                                                        P( S         t 2   ,         t 1         si ; O t 1 )
                 [max P (              t             sj |           t 1          si ) max P ( S                          t 2   ,        t 1       si ; O    t   )]
                                                                                                                                                                1
                         i                                                                   S   t 2
                                                                        P(       t               kt    |    t             sj)
                 [max pij                   t 1      (i )] a jk t
                     i
197
Back Pointers

• The predecessor of state si in the path corresponding to
   t(i) :

              t    ( j ) argmax (      t 1   (i ) pij )
                            1 i m



• Optimal state sequence:
              *
            s kT     argmax n (i )
                       1 i m
             *                *
            skt        t 1
                           ( skt 1 )   for t 1, ... , n 1

                                                            198
The Trellis

                    moo   hello    quack    $


              t=0   t=1   t=2       t=3     t=4


      START   1     0      0         0      0



      COW     0     0.9    0.045     0      0


                                   0.0081
                                            0
      DUCK    0     0      0.108
                                   0.0324



      END     0     0      0         0       0.00648


199
Implementation (Python)
observations = ['^','moo','hello','quack','$']    # signal sequence
states       = ['start','cow','duck','end']

# Transition probabilities -   p[FromState][ToState] = probability
p = {'start': {'cow':1.0},
     'cow':   {'cow' :0.5,
               'duck':0.3,
               'end' :0.2},
     'duck': {'duck':0.5,
               'cow' :0.3,
               'end' :0.2}}

# Emission probabilities; special emission symbol '$' for 'end' state
a = {'cow' : {'moo' :0.9,'hello':0.1, 'quack':0.0, '$':0.0},
     'duck': {'moo' :0.0,'hello':0.4, 'quack':0.6, '$':0.0},
     'end' : {'moo' :0.0,'hello':0.0, 'quack':0.0, '$':1.0}}

200
Machine Learning with Applications in Categorization, Popularity and Sequence Labeling
Machine Learning with Applications in Categorization, Popularity and Sequence Labeling
Machine Learning with Applications in Categorization, Popularity and Sequence Labeling
Machine Learning with Applications in Categorization, Popularity and Sequence Labeling
Machine Learning with Applications in Categorization, Popularity and Sequence Labeling
Machine Learning with Applications in Categorization, Popularity and Sequence Labeling
Machine Learning with Applications in Categorization, Popularity and Sequence Labeling
Machine Learning with Applications in Categorization, Popularity and Sequence Labeling
Machine Learning with Applications in Categorization, Popularity and Sequence Labeling
Machine Learning with Applications in Categorization, Popularity and Sequence Labeling
Machine Learning with Applications in Categorization, Popularity and Sequence Labeling
Machine Learning with Applications in Categorization, Popularity and Sequence Labeling
Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

More Related Content

What's hot

Natural language processing (nlp)
Natural language processing (nlp)Natural language processing (nlp)
Natural language processing (nlp)Kuppusamy P
 
Autoencoders
AutoencodersAutoencoders
AutoencodersCloudxLab
 
5. phases of nlp
5. phases of nlp5. phases of nlp
5. phases of nlpmonircse2
 
Lexical analysis - Compiler Design
Lexical analysis - Compiler DesignLexical analysis - Compiler Design
Lexical analysis - Compiler DesignKuppusamy P
 
Syntax directed translation
Syntax directed translationSyntax directed translation
Syntax directed translationAkshaya Arunan
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers Arvind Devaraj
 
Design and Analysis of Algorithms
Design and Analysis of AlgorithmsDesign and Analysis of Algorithms
Design and Analysis of AlgorithmsSwapnil Agrawal
 
Regularization in deep learning
Regularization in deep learningRegularization in deep learning
Regularization in deep learningKien Le
 
Quick tutorial on IEEE 754 FLOATING POINT representation
Quick tutorial on IEEE 754 FLOATING POINT representationQuick tutorial on IEEE 754 FLOATING POINT representation
Quick tutorial on IEEE 754 FLOATING POINT representationRitu Ranjan Shrivastwa
 
Analysis of algorithms
Analysis of algorithmsAnalysis of algorithms
Analysis of algorithmsGanesh Solanke
 
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...Edureka!
 
02. chapter 3 lexical analysis
02. chapter 3   lexical analysis02. chapter 3   lexical analysis
02. chapter 3 lexical analysisraosir123
 
Turing Machine
Turing MachineTuring Machine
Turing MachineRajendran
 
Asymptotic analysis
Asymptotic analysisAsymptotic analysis
Asymptotic analysisSoujanya V
 

What's hot (20)

Natural language processing (nlp)
Natural language processing (nlp)Natural language processing (nlp)
Natural language processing (nlp)
 
Autoencoders
AutoencodersAutoencoders
Autoencoders
 
5. phases of nlp
5. phases of nlp5. phases of nlp
5. phases of nlp
 
Lexical analysis - Compiler Design
Lexical analysis - Compiler DesignLexical analysis - Compiler Design
Lexical analysis - Compiler Design
 
Syntax directed translation
Syntax directed translationSyntax directed translation
Syntax directed translation
 
Code optimization
Code optimizationCode optimization
Code optimization
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
 
Design and Analysis of Algorithms
Design and Analysis of AlgorithmsDesign and Analysis of Algorithms
Design and Analysis of Algorithms
 
Regularization in deep learning
Regularization in deep learningRegularization in deep learning
Regularization in deep learning
 
Quick tutorial on IEEE 754 FLOATING POINT representation
Quick tutorial on IEEE 754 FLOATING POINT representationQuick tutorial on IEEE 754 FLOATING POINT representation
Quick tutorial on IEEE 754 FLOATING POINT representation
 
Python algorithm
Python algorithmPython algorithm
Python algorithm
 
Analysis of algorithms
Analysis of algorithmsAnalysis of algorithms
Analysis of algorithms
 
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
 
Finite Automata
Finite AutomataFinite Automata
Finite Automata
 
02. chapter 3 lexical analysis
02. chapter 3   lexical analysis02. chapter 3   lexical analysis
02. chapter 3 lexical analysis
 
Hill climbing algorithm
Hill climbing algorithmHill climbing algorithm
Hill climbing algorithm
 
Turing Machine
Turing MachineTuring Machine
Turing Machine
 
Divide and Conquer
Divide and ConquerDivide and Conquer
Divide and Conquer
 
Deep learning
Deep learningDeep learning
Deep learning
 
Asymptotic analysis
Asymptotic analysisAsymptotic analysis
Asymptotic analysis
 

Viewers also liked

Machine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachinePulse
 
Application of machine learning in industrial applications
Application of machine learning in industrial applicationsApplication of machine learning in industrial applications
Application of machine learning in industrial applicationsAnish Das
 
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & EvaluationLecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & EvaluationMarina Santini
 
Applications of Machine Learning at USC
Applications of Machine Learning at USCApplications of Machine Learning at USC
Applications of Machine Learning at USCSri Ambati
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningLior Rokach
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningRahul Jain
 
Machine Learning
Machine LearningMachine Learning
Machine Learningbutest
 
Applications in Machine Learning
Applications in Machine LearningApplications in Machine Learning
Applications in Machine LearningJoel Graff
 
E-commerce product classification with deep learning
E-commerce product classification with deep learning E-commerce product classification with deep learning
E-commerce product classification with deep learning Christopher Bonnett Ph.D
 
Machine Learning and Applications
Machine Learning and ApplicationsMachine Learning and Applications
Machine Learning and ApplicationsGeeta Arora
 
Machine learning in image processing
Machine learning in image processingMachine learning in image processing
Machine learning in image processingData Science Thailand
 
Basics of Machine Learning
Basics of Machine LearningBasics of Machine Learning
Basics of Machine Learningbutest
 
An introduction to Machine Learning
An introduction to Machine LearningAn introduction to Machine Learning
An introduction to Machine Learningbutest
 
Parts of Speect Tagging
Parts of Speect TaggingParts of Speect Tagging
Parts of Speect Taggingtheyaseen51
 
2016 04-19 machine learning
2016 04-19 machine learning2016 04-19 machine learning
2016 04-19 machine learningMark Reynolds
 
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionMachine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionAkihiro Hayashi
 
Making Hardware Accelerator Easier to Use
Making Hardware Accelerator Easier to UseMaking Hardware Accelerator Easier to Use
Making Hardware Accelerator Easier to UseKazuaki Ishizaki
 
Comparing Machine Learning Algorithms in Text Mining
Comparing Machine Learning Algorithms in Text MiningComparing Machine Learning Algorithms in Text Mining
Comparing Machine Learning Algorithms in Text MiningAndrea Gigli
 
SF ElasticSearch Meetup 2013.04.06 - Monitoring
SF ElasticSearch Meetup 2013.04.06 - MonitoringSF ElasticSearch Meetup 2013.04.06 - Monitoring
SF ElasticSearch Meetup 2013.04.06 - MonitoringSushant Shankar
 

Viewers also liked (20)

Machine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachine Learning and Real-World Applications
Machine Learning and Real-World Applications
 
Application of machine learning in industrial applications
Application of machine learning in industrial applicationsApplication of machine learning in industrial applications
Application of machine learning in industrial applications
 
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & EvaluationLecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
 
Applications of Machine Learning at USC
Applications of Machine Learning at USCApplications of Machine Learning at USC
Applications of Machine Learning at USC
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Applications in Machine Learning
Applications in Machine LearningApplications in Machine Learning
Applications in Machine Learning
 
E-commerce product classification with deep learning
E-commerce product classification with deep learning E-commerce product classification with deep learning
E-commerce product classification with deep learning
 
Machine Learning and Applications
Machine Learning and ApplicationsMachine Learning and Applications
Machine Learning and Applications
 
Machine learning in image processing
Machine learning in image processingMachine learning in image processing
Machine learning in image processing
 
Basics of Machine Learning
Basics of Machine LearningBasics of Machine Learning
Basics of Machine Learning
 
An introduction to Machine Learning
An introduction to Machine LearningAn introduction to Machine Learning
An introduction to Machine Learning
 
Parts of Speect Tagging
Parts of Speect TaggingParts of Speect Tagging
Parts of Speect Tagging
 
2016 04-19 machine learning
2016 04-19 machine learning2016 04-19 machine learning
2016 04-19 machine learning
 
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU SelectionMachine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection
 
Text mining meets neural nets
Text mining meets neural netsText mining meets neural nets
Text mining meets neural nets
 
Making Hardware Accelerator Easier to Use
Making Hardware Accelerator Easier to UseMaking Hardware Accelerator Easier to Use
Making Hardware Accelerator Easier to Use
 
Comparing Machine Learning Algorithms in Text Mining
Comparing Machine Learning Algorithms in Text MiningComparing Machine Learning Algorithms in Text Mining
Comparing Machine Learning Algorithms in Text Mining
 
SF ElasticSearch Meetup 2013.04.06 - Monitoring
SF ElasticSearch Meetup 2013.04.06 - MonitoringSF ElasticSearch Meetup 2013.04.06 - Monitoring
SF ElasticSearch Meetup 2013.04.06 - Monitoring
 

Similar to Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

Targeting Sentiment Expressions through Supervised Ranking of Linguistic Conf...
Targeting Sentiment Expressions through Supervised Ranking of Linguistic Conf...Targeting Sentiment Expressions through Supervised Ranking of Linguistic Conf...
Targeting Sentiment Expressions through Supervised Ranking of Linguistic Conf...Jason Kessler
 
Foundation Models in Recommender Systems
Foundation Models in Recommender SystemsFoundation Models in Recommender Systems
Foundation Models in Recommender SystemsAnoop Deoras
 
MLSEV. Logistic Regression, Deepnets, and Time Series
MLSEV. Logistic Regression, Deepnets, and Time Series MLSEV. Logistic Regression, Deepnets, and Time Series
MLSEV. Logistic Regression, Deepnets, and Time Series BigML, Inc
 
Stathy DevOps in MSP / MKE on IAC
Stathy DevOps in MSP / MKE on IACStathy DevOps in MSP / MKE on IAC
Stathy DevOps in MSP / MKE on IACStathy Touloumis
 
Sigma Knowledge Engineering Environment
Sigma Knowledge Engineering EnvironmentSigma Knowledge Engineering Environment
Sigma Knowledge Engineering EnvironmentKingsley Uyi Idehen
 
EU projects MODAClouds and JUNIPER – Writing and testing transformations from...
EU projects MODAClouds and JUNIPER – Writing and testing transformations from...EU projects MODAClouds and JUNIPER – Writing and testing transformations from...
EU projects MODAClouds and JUNIPER – Writing and testing transformations from...Marcos Almeida
 
Effective entrepreneurship for developers
Effective entrepreneurship for developersEffective entrepreneurship for developers
Effective entrepreneurship for developersCarlos Ble
 
Top-Down? Bottom Up? A Survey of Hierarchical Design Methodologies
Top-Down? Bottom Up? A Survey of Hierarchical Design MethodologiesTop-Down? Bottom Up? A Survey of Hierarchical Design Methodologies
Top-Down? Bottom Up? A Survey of Hierarchical Design MethodologiesTrent McConaghy
 
Sharded By Business Line: Migrating to a Core Database using MongoDB and Solr
Sharded By Business Line: Migrating to a Core Database using MongoDB and SolrSharded By Business Line: Migrating to a Core Database using MongoDB and Solr
Sharded By Business Line: Migrating to a Core Database using MongoDB and SolrMongoDB
 
Mongo la search platform - january 2013
Mongo la   search platform - january 2013Mongo la   search platform - january 2013
Mongo la search platform - january 2013MongoDB
 
Software Architecture
Software ArchitectureSoftware Architecture
Software ArchitectureYoav Avrahami
 
Ontology and semantic web (2016)
Ontology and semantic web (2016)Ontology and semantic web (2016)
Ontology and semantic web (2016)Craig Trim
 
[CARS2012@RecSys]Optimal Feature Selection for Context-Aware Recommendation u...
[CARS2012@RecSys]Optimal Feature Selection for Context-Aware Recommendation u...[CARS2012@RecSys]Optimal Feature Selection for Context-Aware Recommendation u...
[CARS2012@RecSys]Optimal Feature Selection for Context-Aware Recommendation u...YONG ZHENG
 
Functional solid
Functional solidFunctional solid
Functional solidMatt Stine
 
Solr: Search at the Speed of Light
Solr: Search at the Speed of LightSolr: Search at the Speed of Light
Solr: Search at the Speed of LightErik Hatcher
 
Fixing the program my computer learned: End-user debugging of machine-learned...
Fixing the program my computer learned: End-user debugging of machine-learned...Fixing the program my computer learned: End-user debugging of machine-learned...
Fixing the program my computer learned: End-user debugging of machine-learned...City University London
 

Similar to Machine Learning with Applications in Categorization, Popularity and Sequence Labeling (16)

Targeting Sentiment Expressions through Supervised Ranking of Linguistic Conf...
Targeting Sentiment Expressions through Supervised Ranking of Linguistic Conf...Targeting Sentiment Expressions through Supervised Ranking of Linguistic Conf...
Targeting Sentiment Expressions through Supervised Ranking of Linguistic Conf...
 
Foundation Models in Recommender Systems
Foundation Models in Recommender SystemsFoundation Models in Recommender Systems
Foundation Models in Recommender Systems
 
MLSEV. Logistic Regression, Deepnets, and Time Series
MLSEV. Logistic Regression, Deepnets, and Time Series MLSEV. Logistic Regression, Deepnets, and Time Series
MLSEV. Logistic Regression, Deepnets, and Time Series
 
Stathy DevOps in MSP / MKE on IAC
Stathy DevOps in MSP / MKE on IACStathy DevOps in MSP / MKE on IAC
Stathy DevOps in MSP / MKE on IAC
 
Sigma Knowledge Engineering Environment
Sigma Knowledge Engineering EnvironmentSigma Knowledge Engineering Environment
Sigma Knowledge Engineering Environment
 
EU projects MODAClouds and JUNIPER – Writing and testing transformations from...
EU projects MODAClouds and JUNIPER – Writing and testing transformations from...EU projects MODAClouds and JUNIPER – Writing and testing transformations from...
EU projects MODAClouds and JUNIPER – Writing and testing transformations from...
 
Effective entrepreneurship for developers
Effective entrepreneurship for developersEffective entrepreneurship for developers
Effective entrepreneurship for developers
 
Top-Down? Bottom Up? A Survey of Hierarchical Design Methodologies
Top-Down? Bottom Up? A Survey of Hierarchical Design MethodologiesTop-Down? Bottom Up? A Survey of Hierarchical Design Methodologies
Top-Down? Bottom Up? A Survey of Hierarchical Design Methodologies
 
Sharded By Business Line: Migrating to a Core Database using MongoDB and Solr
Sharded By Business Line: Migrating to a Core Database using MongoDB and SolrSharded By Business Line: Migrating to a Core Database using MongoDB and Solr
Sharded By Business Line: Migrating to a Core Database using MongoDB and Solr
 
Mongo la search platform - january 2013
Mongo la   search platform - january 2013Mongo la   search platform - january 2013
Mongo la search platform - january 2013
 
Software Architecture
Software ArchitectureSoftware Architecture
Software Architecture
 
Ontology and semantic web (2016)
Ontology and semantic web (2016)Ontology and semantic web (2016)
Ontology and semantic web (2016)
 
[CARS2012@RecSys]Optimal Feature Selection for Context-Aware Recommendation u...
[CARS2012@RecSys]Optimal Feature Selection for Context-Aware Recommendation u...[CARS2012@RecSys]Optimal Feature Selection for Context-Aware Recommendation u...
[CARS2012@RecSys]Optimal Feature Selection for Context-Aware Recommendation u...
 
Functional solid
Functional solidFunctional solid
Functional solid
 
Solr: Search at the Speed of Light
Solr: Search at the Speed of LightSolr: Search at the Speed of Light
Solr: Search at the Speed of Light
 
Fixing the program my computer learned: End-user debugging of machine-learned...
Fixing the program my computer learned: End-user debugging of machine-learned...Fixing the program my computer learned: End-user debugging of machine-learned...
Fixing the program my computer learned: End-user debugging of machine-learned...
 

Recently uploaded

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 

Recently uploaded (20)

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Machine Learning with Applications in Categorization, Popularity and Sequence Labeling

  • 1. Machine Learning with Applications in Categorization, Popularity and Sequence labeling (linear models, decision trees, ensemble methods, evaluation) Dr. Nicolas Nicolov <1st_last@yahoo.com>
  • 2. Goals • Introduce important ML concepts • Illustrate ML techniques through examples in: – Categorization – Popularity – Sequence labeling (tutorial aims to be self-contained and to explain the notation) 2
  • 3. Outline • Examples of applications of Machine Learning • Encoding objects with features • The Machine Learning framework • Linear models – Perceptron, Winnow, Logistic Regression, Robust Risk Minimization (RRM) • Tree models (Decision Trees DTs) – Classification Decision Trees, Regression Trees • Boosting – AdaBoost • Ranking evaluation – Kendall tau and Spearman’s coefficient • Sequence labeling – Hidden Markov Models (HMMs) 3
  • 4. EXAMPLES OF MACHINE LEARNING Why?– Get a flavor of the diversity of areas where ML is applied. 4
  • 5. Sequence Labeling (like search query analysis) Geo-Political Entity PER_ _PER_ _PER X GPE George W. Bush discussed Iraq <PER>George W. Bush</PER> discussed <GPE>Iraq</GPE> George W. Bush discussed Iraq 5
  • 6. Spam www.dietsthatwork.com www . dietsthatwork . com further segmentation www . diets that work . com classification SPAM! 6
  • 7. Tokenization What!?I love the iphone:-) What !? I love the iphone :-) How difficult can that be? — 98.2% [Zhang et al. 2003] NO TRESSPASSING VIOLATORS WILL BE PROSECUTED 7
  • 8. NL Parsing syntactic structure PREP CONTR DOBJ MANR POSS SUBJ DET DET MOD MOD MOD Unlike my sluggish Chevy the Audi handles the winding mountain roads superbly 8
  • 9. State Transitions LEFTARC: λ λ β RIGHTARC: λ λ β λ β NOARC: λ λ β SHIFT: using ML to make the decision λ λ β which action to take 9
  • 10. Two Ladies in a Men’s Club 10
  • 11. SUBJ IOBJ We serve men We serve food to men. We serve our community. serve —IndirectObject men SUBJ DOBJ We serve men We serve organic food. We serve coffee to connoiseurs. serve —DirectObject men 11
  • 12. Coreference Audi is an automaker that makes luxury cars and SUVs. The company was born in Germany . It was established by August Horch in 1910. Horch had previosly founded another company and his models were quite popular. Audi started with four cylinder models. By 1914, Horch 's new cars were racing and winning. August Horch left the Audi company in 1920 to take a position as an industry representative for the German motor vehicle industry federation. Currently Audi is a subsidiary of the Volkswagen group and produces cars of outstanding quality. 12
  • 13. Parts of Objects (Meronymy) […] the interior seems upscale with leatherette upholstery that looks and feels better than the real cow hide found in more expensive vehicles, a dashboard accented by textured soft-touch materials, a woven mesh headliner, and other materials that give the New Beetle’s interior a sense of quality. […] Finally, and a big plus in my book, both front seats were height adjustable, and the steering column tilted and telescoped for optimum comfort. 13
  • 14. Sentiment Analysis Positive Negative Xbox Xbox I love pineapple nearly as much as I hate bananas. POSITIVE sentiment regarding topic pineapple. 14
  • 15. Chinese Sentiment Sentence Car aspects Sentiment categories 15
  • 16. 16
  • 17. 17
  • 18. Categorization • High-level task: – Given a restaurant what is its restaurant sub-category? • Encoding entities with features • Feature selection non-standard order • Linear models “Though this be madness, yet there is method in't.” 18
  • 19. Roadmap • Examples of applications of Machine Learning • Encoding objects with features • The Machine Learning framework • Linear models – Perceptron, Winnow , Logistic Regression, Robust Risk Minimization (RRM) • Tree models (Decision Trees DTs) – Classification Decision Trees, Regression Trees • Boosting – AdaBoost • Ranking evaluation – Kendall tau and Spearman’s coefficient • Sequence labeling – Hidden Markov Models (HMMs) 19
  • 20. ENCODING OBJECTS WITH FEATURES Why?– ML algorithms are “generic”; most of them are cast as solutions around vector encodings of the domain objects. Regardless of the ML algorithm we will need to represent/encode the domain objects as feature vectors. How well we do this (the quality of features) directly impacts system performance. 20
  • 21. Flat Object Encoding Can be a set; object can belong Number of to several classes. features can be millions. 37 1 0 0 1 1 1 0 1 … Machine learning (training) instance/example/observation. 21
  • 22. Structured Objects to Strings to Features Table can be quite large. Structured object: Feature string Feature index Read as field “f2:f4” contains feature “a”. *DEFAULT* 0 f1 … … f2 f2:f4>a 100 f4 abcde “f2:f4>a” f2:f4>b 101 f5 “f2:f4>b” uni-grams “f2:f4>c” f2:f4>c 102 f3 … … … f6 “f2:f4>a_b” f2:f4>a_b 105 “f2:f4>b_c” bi-grams “f2:f4>c_d” f2:f4>b_c 106 … f2:f4>c_d 107 “f2:f4>a_b_c” tri-grams … … “f2:f4>b_c_d” f2:f4>a_b_c 109 22
  • 23. Sliding Window (bi-grams) SkyCity at the Space Needle add initial “^” and final “$” tokens ^ SkyCity at the Space Needle $ sliding window ^ SkyCity at the Space Needle $ ^ SkyCity at the Space Needle $ ^ SkyCity at the Space Needle $ ^ SkyCity at the Space Needle $ 23
  • 24. Example: Feature Templates public static List<string> NGrams( string field ) could add field name as argument and prefix all features { var featutes = new List<string>(); string[] tokens = field.Split( spaceCharArr, System.StringSplitOptions.RemoveEmptyEntries ); featutes.Add( string.Join( "", field.Split(SPLIT_CHARS) ) ); // the entire field string unigram = string.Empty, bigram, previous1 = "^", previous2 = "^", trigram; for (int i = 0; i < tokens.Length; i++) { unigram = tokens[ i ]; featutes.Add(unigram); bigram = previous1 + "_" + unigram; initial bigram is “^_tokens*0]" featutes.Add( bigram ); if ( i >= 1 ) { trigram = previous2 + "_" + bigram; featutes.Add( trigram ); } previous2 = previous1; initial tri-gram is: "^_tokens[0]_tokens[1] " previous1 = unigram; } featutes.Add( unigram + "_$" ); featutes.Add( bigram + "_$" ); last trigram is “tokens*tokens.Length-2]_tokens[tokens.Length-1]_$" return result; 24 }
  • 25. The Art of Feature Engineering: Disjunctive Features • Useful feature = triggers often and with a particular class. • Rarely occurring (but indicative of a class) features can be combined in a disjunction. This results in: – Need for less data to achieve good performance. – Final system performance (with all available data) is higher. • How can we get insights about such features: Error analysis! Regex ITALIAN_FOOD = new Regex(@"al dente|agnello|alfredo|antipasti|antipasto|arrabbiata|bistecca|bolognese| branzino|caprese|carbonara|carpaccio|cioppino|cozze|fettuccine|filetto|focaccia|frutti di mare|funghi| gnocchi|gorgonzola|insalata|lasagna|linguine|linguini|macaroni|minestrone|mozzarella|ossobuco|panini| panino| parmigiana|pasticcio|pecorino|penne|pepperoncini|pesce|pesto|piatti|piatto|piccata|polpo|pomodori|prosciutto| radicchio|ravioli|ricotta|rigatoni|risotto|saltimbocca|scallopini|scaloppini|spaghetti|tagliatelle|tiramisu| tortellini|vitello|vongole"); if (ITALIAN_FOOD.Match(entity.description).Success) features.Add("Italian_Food_Matched_Description"); Triggering of the feature. Up to us how we call the feature. 25
  • 26. Generic Nature of ML Systems human sees Indices of (binary) features that trigger. instance( class= 7, features=[0,300857,100739,200441,...]) computer “sees” instance( class=99, features=[0,201937,196121,345758,13,...]) instance( class=42, features=[0,99173,358387,1001,1,...]) ... Number of features that trigger for individual instances are often not the same. 26 Default feature always triggers.
  • 27. Training Data Instance /w outcome. 27
  • 28. Feature Selection • Templates: powerful way to get lots of features. • We get too many features. e.g., 20M for dependency parsing. • Danger of overfitting. Doing well on seen data but poorly on unseen data. • Feature selection: Automatic ways of finding discriminative features. – CountCutOff. – TFxIDF. – Mutual information. – Information gain. – Chi square. We will examine in detail the implementation of this. 28
  • 30. Information Gain Balances effects of feature triggering for an object with the effects of feature being absent for an object. 30
  • 31. Chi Square float Chi2(int a, int b, int c, int d) { return (a+b+c+d)* ((a*d-b*c)^2) / ((a+b)*(a+c)*(c+d)*(b+d)); } 31
  • 32. Exponent(Log) Trick While the final output may not be big intermediate results are. Solution: float Chi2(int a, int b, int c, int d) { return (a+b+c+d) * ((a*d-b*c)^2) / ((a+b)*(a+c)*(c+d)*(b+d)); } float Chi2_v2(int a, int b, int c, int d) { double total = a + b + c + d; double n = Math.Log(total); double num = 2.0 * Math.Log(Math.Abs((a * d) - (b * c))); double den = Math.Log(a + b) + Math.Log(a + c) + Math.Log(c + d) + Math.Log(b + d); return (float) Math.Exp(n+num-den); 32 }
  • 33. Chi Square: Score per Feature 33
  • 34. Chi Square Feature Selection int[] featureCounts = new int[ numFeatures ]; int numLabels = labelIndex.Count; int[] classTotals = new int[ numLabels ]; // instances with that label. float[] classPriors = new float[ numLabels ]; // class priors: classTotals[label]/numInstances. int[,] counts = new int[ numLabels, numFeatures ]; // (label,feature) co-occurrence counts. int numInstances = instances.Count; ... Do a pass over the data and collect above counts. float[] weightedChiSquareScore = new float[ numFeatures ]; for (int f = 0; f < numFeatures; f++) // f is a feature index { float score = 0.0f; for (int labelIdx = 0; labelIdx < numLabels; labelIdx++) { int a = counts[ labelIdx, f ]; int b = classTotals[ labelIdx ] - p; int c = featureCounts[ f ] - p; int d = numInstances - ( p + q + r ); if (p >= MIN_SUPPORT && q >= MIN_SUPPORT) { // MIN_SUPPORT = 5 score += classPriors[ labelIdx ] * Chi2( a, b, c, d ); } } Weighted average across all classes. weightedChiSquareScore[ f ] = score; } 34
  • 35. ⇒ Summary: Encoding • Object representation is crucial. • Humans: good at suggesting features (templates). • Computers: good at filtering (feature selection). The system designer does not have to worry about which feature is more important or useful, and the job is left to the learning algorithm to assign appropriate weights to the corresponding features. The system designer’s job is to define a set of features that is large enough to represent most of the useful information, yet small enough to be manageable for the algorithms and the infrastructure. • Feature engineering: Ensuring systems use the “right” features. 35
  • 36. Roadmap • Examples of applications of Machine Learning • Encoding objects with features • The Machine Learning framework • Linear models – Perceptron, Winnow , Logistic Regression, Robust Risk Minimization (RRM) • Tree models (Decision Trees DTs) – Classification Decision Trees, Regression Trees • Boosting – AdaBoost • Ranking evaluation – Kendall tau and Spearman’s coefficient • Sequence labeling – Hidden Markov Models (HMMs) 36
  • 38. Machine Learning: Representation Complex decision making: prediction (response/dependent variable). input/independent variable Can be qualitative/quantitative (classification/regression). classifier 38
  • 39. Notation 39
  • 40. Machine Learning object encoded with features Offline Online Training Model classifier System Sub-system TRAINING prediction (response/dependent variable) 40
  • 41. Classes of Learning Problems • Classification: Assign a category to each item (Chinese | French | Indian | Italian | Japanese restaurant). • Regression: Predict a real value for each item (stock/currency value, temperature). • Ranking: Order items according to some criterion (web search results relevant to a user query). • Clustering: Partition items into homogeneous groups (clustering twitter posts by topic). • Dimensionality reduction: Transform an initial representation of items into a lower-dimensional representation while preserving some properties (preprocessing of digital images). 41
  • 42. ML Terminology • Examples: Items or instances used for learning or evaluation. • Features: Set of attributes represented as a vector associated with an example. • Labels: Values or categories assigned to examples. In classification the labels are categories; in regression the labels are real numbers. • Target: The correct label for a training example. This is extra data that is needed for supervised learning. • Output: Prediction label from input set of features using a model of the machine learning algorithm. • Training sample: Examples used to train a machine learning algorithm. • Validation sample: Examples used to tune parameters of a learning algorithm. • Model: Information that the machine learning algorithm stores after training. The model is used when predicting the output labels of new, unseen examples. • Test sample: Examples used to evaluate the performance of a learning algorithm. The test sample is separate from the training and validation data and is not made available in the learning stage. • Loss function: A function that measures the difference/loss between a predicted label and a true label. We will design the learning algorithms so that they minimize the error (cumulative loss across all training examples). • Hypothesis set: A set of functions mapping features (feature vectors) to the set of labels. The learning algorithm chooses one function among those in the hypothesis set to return after training. Usually we pick a class of functions (e.g., linear functions) parameterized by a set of free parameters (e.g., coefficients of the linear function) and pinpoint the final hypothesis by identifying the parameters that minimize the error. • Model selection: Process for selecting the free parameters of the algorithm (actually of the function in the hypothesis set). 42
  • 43. Classification Yes, this is mysterious at this point. + − + + − − + − + + − − − + − + + − − + − + − − 43 decision boundary
  • 45. One-Versus-All (OVA) For each category in turn, create a binary classifier where an instance in the data belonging to the category is considered a positive example, all other examples are considered negative examples. Given a new object, run all these binary classifiers and see which classifier has the “highest prediction”. The scores from the different classifiers need to be calibrated! 45
  • 46. One-Versus-One (OVO) For each pair of classes, create binary classifier on data labeled as either of the classes. How many such classifiers? Given a new instance run all classifiers and predict class with maximum number of wins. 46
  • 47. Errors “Nobody is perfect, but then again, who wants to be nobody.” Average error across all instances. Goal: Minimize the Error. Beneficial to have differentiable loss function. #misclassified examples (penalty score of 1 for every misclassified example). 47
  • 48. Error: Function of the Parameters The cumulative error across all instances is a function of the parameters. 1 2 48
  • 49. Evaluation • Motivation: – Benchmark algorithms (which system is better). – Tuning parameters during training. 49
  • 50. Evaluation Measures GeneralizationError: Probability to misclassify an instance selected according to the distribution of the labeled instance space TrainingError: Percentage of training examples which are correctly classified. Optimistically biased estimate especially if the inducer over-fits the (training) data. Empirical estimation of the generalization error: • Heldout method • Re-sampling: 1. Random resampling 2. Cross-validation 50
  • 51. Precision, Recall and F-measure General Setup Let’s consider binary classification: Space of all instances System identified these as negative and got them correct (true negative). System identified these as positive System identified System identified but got them these as positive these as negative wrong but got them but got them (false positive). correct wrong (true positive). (false negative). Instances identified as Positive instances in reality. positive by the system. 51
  • 52. Definitions Accuracy, Precision, Recall, and F-measure TN: true negatives Precision: FP: false positives TP: true positives FN: false negatives Recall: F-measure: Harmonic mean of Accuracy: precision and recall 52
  • 53. Accuracy vs. Prec/Rec/F-meas Accuracy can be misleading for evaluating a model with an imbalanced distribution of the class. When there are more majority class instances than minority class instances, predicting always the majority class gives good accuracy. Precision and recall (together) are better indicators. As a single, aggregate number f-measure favors the lower of the precision or recall. 53
  • 54. Extreme Cases for Precision & Recall all instances TN: true negatives TP: true positive FN: false negatives system actual all instances system FP: false positives TP: true positives Precision can be traded for recall and vice 54 versa. actual
  • 55. Definitions Sensitivity & Specificity TN: true negatives FP: false positives [same as recall; TP: aka true positive rate] true positives FN: false negatives [aka true negative rate] False positive rate: False negative rate: 55
  • 56. Venn Diagrams These visualization diagrams were introduced by John Venn: John Venn (1880) “On the Diagrammatic and Mechanical Representation of Propositions and Reasonings”, Philosophical Magazine and Journal of Science, 5:10(59). What if there are three classes? Four classes? With more classes our visual intuitions are helping less and less. A subtle point: These are just the actual/real classes without the system Six classes? classes drawn on top! 56
  • 57. Confusion Matrix Shows how the predictions of instances of an actual class are distributed across all classes. Here is an example confusion matrix for three classes: Predicted class A Predicted class B Predicted class C Number of instances Number of instances in the actual class A in the actual class A Total number of actual Actual class A … AND predicted as BUT predicted as instances of class A belonging to class A. belonging to class B. Total number of actual Actual class B … … … instances of class B Total number of actual Actual class C … … … instances of class C Total number of Total number of Total number of Total number of instances instances predicted instances predicted instances predicted as class A as class B as class C Counts on the diagonal are the true positives for each class. Counts not on the diagonal are errors. Confusion matrices can handle many classes. 57
  • 58. Confusion Matrix: Accuracy, Precision and Recall Given a confusion matrix, it’s easy to compute accuracy, precision and recall: Predicted class A Predicted class B Predicted class C Actual class A 50 80 70 200 Actual class B 40 140 120 300 Actual class C 120 220 160 500 210 440 350 1000 58 Confusion matrices can, themselves, be confusing sometimes 
  • 59. Roadmap • Examples of applications of Machine Learning • Encoding objects with features • The Machine Learning framework • Linear models – Perceptron, Winnow , Logistic Regression, Robust Risk Minimization (RRM) • Tree models (Decision Trees DTs) – Classification Decision Trees, Regression Trees • Boosting – AdaBoost • Ranking evaluation – Kendall tau and Spearman’s coefficient • Sequence labeling – Hidden Markov Models (HMMs) 59
  • 60. LINEAR MODELS Why?– Linear models are good way to learn about core ML concepts. 60
  • 61. Refresher: Vectors vector vector point point vector points are also vectors. Equation of the line. Can be re-written as: sum of vectors vector notation 61 transpose
  • 62. Refresher: Vectors (2) Equation of the line. Can be re-written as: Normal vector. vector notation 62
  • 63. Refresher: Dot Product float DotProduct(float[] v1, float[] v2) { float sum = 0.0; for(int i=0; i< a.Length; i++) sum+= v1[i] * v2[i]; return sum; } 63
  • 64. Refresher: Pos/Neg Classes + Normal vector. − 64
  • 65. sgn Function In mathematics: We will use: Informally drawn as: 65
  • 66. Two Linear Models The features of an object have associated weights indicating their importance. Signal: Perceptron Linear regression 66
  • 67. Why “Regression”? Why the term for quantitative output prediction is “regression”? “That same year [1875], [Francis] Galton decided to recruit some of his friends for an experiment with sweet peas. He distributed seeds among seven of them, asking them to plant the seeds and return the offspring. Galton measured the baby seeds and compared their diameters to those of their parents. He noticed a phenomenon that initially seems counter-intuitive: the large seeds tended to produce smaller offspring, and the small seeds tended to produce larger offspring. A decade later he analyzed data from his anthropometric laboratory and recognized the same pattern with human heights. After measuring 205 pairs of parents and their 928 adult children, he saw that exceptionally tall parents had kids who were generally shorter than they were, while exceptionally short parents had children who were generally taller than their parents. After reflecting upon this, we can understand why it must be the case. If very tall parents always produced even taller children, and if very short parents always produced even shorter ones, we would by now have turned into a race of giants and midgets. Yet this hasn't happened. Human populations may be getting taller as a whole – due to better nutrition and public health – but the distribution of heights within the population is still contained. Galton called this phenomenon ‘regression towards mediocrity in hereditary stature’. The concept is now more generally known as regression to the mean.” [A.Bellos pp.375] 67
  • 68. On-Line (Sequential) Learning • On-line = process one example at a time. • Attractive for large scale problems. iteration (epoch/time). Compute loss. Objective: Minimize cumulative loss: return parameters 68
  • 69. On-Line (Sequential) Learning (2) Sometimes written out more explicitly: # passes over the data. for each data item. return parameters return parameters 69
  • 70. Perceptron Linearly separable data: Non-linearly separable data: + − + + − + + − + + − − − + + + − − + + + + − + − + − − + − − + − + − + − + − − − − + − + − − − + − − + − − 70
  • 71. First: Perceptron Update Rule Simplification: Lines pass through origin. + + − + − 71
  • 73. Perceptron Learning Algorithm iteration (epoch/time). return parameters 73
  • 74. Perceptron Learning Algorithm (algorithm makes multiple passes over data.) return parameters 74
  • 75. Perceptron Learning Algorithm (PLA) while( mis-classified examples exist ): Misclassified example means: With the current weights Update weights: 1. A challenge: Algorithm will not terminate for non-linearly separable data (outliers, noise). 2. Unstable: jump from good perceptron to really bad one within one update. 3. Attempting to minimize: NP-hard. 75
  • 77. Looks Simple – Does It Work? Margin-based upper bound on updates: Number of updates by the Perceptron Algorithm Fact: where: Remarkable: Does not depend on dimension of feature space! 77
  • 78. Compact Model Representation Use float instead of double: Store only non-zero weights (and indices): Store non-zero weights and diff of indices: void Save( StreamWriter w, int labelIdx, float[] weights ) { w.Write( labelIdx ); int previousIndex = 0; for (int i = 0; i < weights.Length; i++) Difference of indices. { if (weights[ i ] != 0.0f) { w.Write( " " + (i - previousIndex) + " " + weights[ i ] ); previousIndex = i; } } Remember last index where the weight was non-zero . } 78
  • 79. Linear Classification Solutions Different solutions (infinitely many) + − + + − − + − + + − + − − − + + − − + − + − − 79
  • 80. The Pocket Algorithm A better perceptron algorithm: Keep track of the error and update weights when we lower the error. Compute error. Expensive step! Access to the entire data needed! Only update the best weights if we lower the error! 80
  • 81. Voted Perceptron • Training as the usual perceptron algorithm (with some extra book-keeping). • Decision rule: iterations 81
  • 82. Dual Perceptron: Intuitions + + + + + + + separating line. + + normal vector − − − − − − − − 82
  • 83. Dual Perceptron (algorithm makes multiple passes over data.) return parameters Decision rule: 83
  • 84. Exclusive OR (XOR) Function Truth table: Inputs in and color-coding of the output: Challenge: The data is not linearly separable ??? (no straight line can be drawn that separates the green from the blue points). 84
  • 85. Solution for the Exclusive OR (XOR) We introduce another input dimension: Now the data is linearly separable: 85
  • 86. Winnow Algorithm iteration (epoch). Normalizing constant. Multiplicative update. return parameters 86
  • 87. Training, Test Error and Complexity Test error Training error Model complexity 87
  • 88. Logistic Regression Target: Data does not give the probability explicitly: Logistic function: 88
  • 90. Derivative: Refresher Chain rule: Partial derivative: Gradient (derivatives with respect to each component): This is a vector and we Gradient of the error: can compute it at a point. 90
  • 91. Hypothesis Space Weight space/hyperplane. 91 [graph from T.Mitchell]
  • 92. Math Fact The gradient of the error: (a vector in weight space) specifies the direction of the argument that leads to the steepest increase for the value of the error. The negative of the gradient gives the direction of the steepest decrease. Negative gradient (see next slides). 92
  • 93. Computing the Gradient Because gradient is a linear operator. 93
  • 94. (Batch) Gradient Descent repeat Compute gradient: Update weights: max #iterations; marginal error improvement; and small value for the error. 94
  • 95. Punch Line The new object is in the class if: classification rule. 95
  • 96. Newton’s Method 2.5 2 1.5 1 0.5 0 0 0.5 1 1.5 2 2.5 3 -0.5 96
  • 98. Robust Risk Minimization Notation: input vector label training examples weight vector bias continuous linear model Prediction rule: Classification error: 98
  • 99. Robust Classification Loss Parameter estimation: Hinge loss: Robust classification loss: 99
  • 101. Confidence and Regularization Confidence Regularization: Unconstrained optimization (Lagrange multiplier): smaller λ corresponds to a larger A. 101
  • 102. Robust Risk Minimization Go over the training data. 102
  • 103. Learning Curve 100 • Plots evaluation metric 90 Experiment with 50% of against fraction of the training data yields training data (on the 80 evaluation number of 70. same test set!). 70 • Highest performance bounded by human inter 60 annotator agreement 50 (ITA). • Leveling off effect that 40 can guide us how much 30 data is needed. 20 10 0 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Percentage of data used for each experiment. 103
  • 104. Summary • Examples of ML • Categorization • Object encoding • Linear models: – Perceptron – Winnow – Logistic Regression – RRM • Engineering aspects of ML systems 104
  • 106. Goal • Quantify how popular an entity is. Motivation: • Used in the new local search relevance metric. 106
  • 108. POPULARITY IN LOCAL SEARCH 108
  • 109. Popularity • Output a popularity score (regression) • Ensemble methods • Tree base procedure (non-linear) • Boosting 109
  • 110. When is a Local Entity Popular? • Definition: Visited by many people in the context of alternative choices. • Is the popularity of restaurants the same as the popularity of movies, etc.? • How to operationalize “visit”, “many”, “alternative choices”? – Initially we are using: popular means clicked more. • Going forward we will use: – “visit” = click given an impression. – “choice” = density of entities in the same primary category. – “many” = fraction of clicks from impressions. 110
  • 111. Local Entity Popularity The model then will be regression: 111
  • 112. Not all Clicks are Born the Same • Click in the context of a named query: – Can even be argued we are not satisfying the user information needs (and they have to click further to find out what they are looking for). • Click in the context of a category query: – Much more significant (especially when alternative results are present). 112
  • 113. Local Entity Popularity • Popularity & 1st page , current ranker. • Entities without URL. • Newly created entities. • Clicks vs. mouseovers. • Scenario: 50 French restaurants; best entity has 2k clicks. 2 Italian restaurants; best entity has 2k clicks. The French entity is more popular because of higher available choice. 113
  • 114. Entity Representation 9000 8000 … 4000 65 4.7 73 … 1 … Target feature values Machine learning (training) instance 114
  • 115. POISSON REGRESSION Why?– We will practice the ML machinery on a different problem, re-iterating the concepts. Poisson regression is an example of log-linear models good for modeling counts (e.g., number of visitors to a store in a certain time). 115
  • 116. Setup explanatory variables response/outcome variable These counts for our scenario are the clicks on the web page. A good way to model counts of observations is using the Poisson distribution. 116
  • 117. Poisson Distribution: Preliminaries The Poisson distribution realistically describes the pattern of requests over time in many client-server situations. Examples are: incoming customers at a bank, calls into a company’s telephone exchange, requests for storage/retrieval services from a database server, and interrupts to a central processor. It also has higher- dimensional applications, such as the spatial distribution of defects on integrated circuit wafers and the volume distribution of contaminants in well water. In such cases, the “events”, which are request arrivals or defects occurrences, are independent. Customers do not conspire to achieve some special pattern in their access to a bank teller; rather they operate as independent agents. The manufacture of hard disks or integrated circuits introduces unavoidable defects because the process pushes the limits of geometric tolerances. Therefore, a perfectly functional process will still occasionally produce a defect, such as a small area on the disk surface where the magnetic material is not spread uniformly or a shorted transistor on an integrated circuit chip. These errors are independent in the sense that a defect at one point does not influence, for better or worse, the chance of a defect at another point. Moreover, if the time interval or spatial area is small, the probability of an event is correspondingly small. This is a characterizing feature of a Poisson distribution: event probability decreases with the window of opportunity and is linear in the limit. A second characterizing feature, negligible probability of two or more events in a small interval, is also present in the mentioned examples. 117
  • 119. Poisson Distribution: Mental Steps This comes from the theory of Generalized Linear Models (GLM). log linear combination of the input features. Hence, the name log-linear model. 119
  • 122. Roadmap • Examples of applications of Machine Learning • Encoding objects with features • The Machine Learning framework • Linear models – Perceptron, Winnow , Logistic Regression, Robust Risk Minimization (RRM) • Tree models (Decision Trees DTs) – Classification Decision Trees, Regression Trees • Boosting – AdaBoost • Ranking evaluation – Kendall tau and Spearman’s coefficient • Sequence labeling – Hidden Markov Models (HMMs) 122
  • 123. DECISION TREES Why?– DTs are an influential development in ML. Combined in ensemble they provide very competitive performance. We will see ensemble techniques in the next part. 123
  • 124. Decision Trees Training instances. Color reflects output variable (classification example). Binary partitioning of the data during training (navigating to leaf node during testing). prediction Training instances are more homogeneous in terms of the output variable (more pure) compared to ancestor nodes. Stopping when instances are homogeneous or 124 small number of instances.
  • 125. Decision Tree: Example (classification example with categorical features) Attribute/feature/predicate Parents Visiting Yes No Value of the attribute Cinema Weather Branching factor depends on the number of possible values Sunny Windy Rainy for the attribute (as seen in the training set). Play Stay in tennis Money Rich Poor Predicted classes. Shopping Cinema 125
  • 126. Entropy (needed for describing how an attribute is selected.) 1 Example 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 126
  • 127. Selecting an Attribute: Information Gain Measure of expected reduction in entropy. instances attribute 127 See Mitchell’97, p.59 for an example.
  • 128. Splitting ‘Hairs’ If there are no instances in the current node, inherit statistics (majority class) from parent node. If there are only a small number of instances do not split the node further (statistics are unreliable). If there is more training data, the 128 tree can be “grown” bigger. ?
  • 130. Alternative Attribute Selection: Gain Ratio [Quinlan 1986] instances attribute Examples: all different values. 130
  • 131. Alternative Attribute Selection: GINI Index [Corrado Gini: Italian statistician] 131
  • 132. Space of Possible Decision Trees Number of possible trees: 132
  • 133. Decision Trees and Rule Systems Path from each leaf node to the root represents a conjunctive rule: if (ParentsVisiting==No) & Parents (Weather==Windy) & Visiting (Money==Poor) then Yes No Cinema. Cinema Weather Sunny Windy Rainy Play Stay in tennis Money Rich Poor Shopping Cinema 133
  • 134. Decision Trees • Different training sample -> different resulting tree (different structure). • Learning does (conditional) feature selection. 134
  • 135. Regression Trees Like classification trees but the prediction is a number (as suggested by “regression”). 1. How do we split? 2. When to stop? predictions (constants) 135
  • 136. Regression Trees: How to Split 136
  • 137. Regression Trees: Pruning Tree operation where a pre-terminal gets its two leaves collapsed: 137
  • 138. Regression Trees: How to Stop 1. Don’t stop. 2. Build big tree. 3. Prune. 4. Evaluate sub-trees. 138
  • 139. Roadmap • Examples of applications of Machine Learning • Encoding objects with features • The Machine Learning framework • Linear models – Perceptron, Winnow , Logistic Regression, Robust Risk Minimization (RRM) • Tree models (Decision Trees DTs) – Classification Decision Trees, Regression Trees • Boosting – AdaBoost • Ranking evaluation – Kendall tau and Spearman’s coefficient • Sequence labeling – Hidden Markov Models (HMMs) 139
  • 140. BOOSTING 140
  • 141. Ensemble Methods object encoded with features classifiers … ENSEMBLE … predictions (response/dependent variable) majority voting/averaging 141
  • 142. Where the Systems Come from Sequential ensemble scheme: … 142
  • 143. Contrast with Bagging Non-sequential ensemble scheme: DATA Datai are independent of each other (likewise for Sytemi). 143
  • 144. Data System Base Procedure: Decision Tree Training instances. Color reflects output variable (classification example). Binary partitioning of the data during training (navigating to leaf node during testing). prediction Training instances are more homogeneous in terms of the output variable (more pure) compared to ancestor nodes. Stopping when instances are homogeneous or 144 small number of instances.
  • 145. Ensemble Scheme TRAINING DATA base procedure base procedure Small systems. Original data Don’t need to be perfect. base procedure Weighted data base procedure Weighted data Final prediction (regression) 145
  • 146. Ada Boost (classification) Original data Weighted data Weighted data Weighted data normalizing factor. final prediction. 146
  • 147. AdaBoost Initializing weights. weight update. normalizing factor. final prediction. 147
  • 148. Binary Classifier • Constraint: – Must not have all zero clicks for current week, previous week and week before last [shopping team uses stronger constraint: only instances with non-zero clicks for current week]. • Training: – 1.5M instances. – 0.5M instances (validation). • Feature extraction: – 4.82mins (Cosmos job). • Training time: – 2hrs 20mins. • Testing: – 10k instances: 1sec. 148
  • 149. Roadmap • Examples of applications of Machine Learning • Encoding objects with features • The Machine Learning framework • Linear models – Perceptron, Winnow , Logistic Regression, Robust Risk Minimization (RRM) • Tree models (Decision Trees DTs) – Classification Decision Trees, Regression Trees • Boosting – AdaBoost • Ranking evaluation – Kendall tau and Spearman’s coefficient • Sequence labeling – Hidden Markov Models (HMMs) 149
  • 150. POPULARITY EVALUATION How do we know we have a good popularity? 150
  • 151. Rank Correlation Metrics • • • • • • The two rankings are the same. The two rankings are reverse of each other. Actual input is a set of objects with two rank scores (ties are possible). 151
  • 152. Kendall’s Tau Coefficient Considers concordant/discordant pairs in two rankings (each ranking w.r.t. the other): 152
  • 153. What is a concordant pair? a a b c c b Need to have the same sign 153
  • 154. Kendall Tau: Example A C B D C A D B Pairs: (discordant pairs in red): 154 Observation: Total number of discordant pairs = 2x the discordant pairs in one ranking w.r.t. the other.
  • 155. Spearman’s Coefficient Considers ranking differences for the same object: a a b c c b Example: 155
  • 156. Rank Intuitions: Setup 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 The sequence <3,1,4,10,5,9,2,6,8,7> is sufficient to encode the two rankings. 156
  • 157. Rank Intuitions: Pairs Rankings in complete agreement. Rankings in complete dis-agreement. 157
  • 158. Rank Intuitions: Spearman Segment lengths represent R1 rank scores. 158
  • 159. Rank Intuitions: Kendall Segment lengths represent R1 rank scores. 159
  • 160. What about ties? The position of an object within set of objects with the same scores in the rankings affects the rank correlation. 160 For example, red positioning of oj leads to lower Spearman’s coefficient; green – higher.
  • 161. Ties • Kendall: Strict discordance: • Spearman: – Can use per entity upper and lower bounds. – Do as in the Olympics: 161
  • 162. Ties: Kendall TauB where: is the number of concordant pairs. is the number of discordant pairs. is the number of objects in the two rankings. http://en.wikipedia.org/wiki/Kendall_tau#Tau-b 162
  • 163. Uses of popularity Popularity can be used to augment gain in NDCG by linearly scaling it: 1 3 7 15 31 1 2 3 4 5 poor fair good excellent perfect 163
  • 164. Next Steps • How to determine popularity of new entities – Challenge: No historical data. – Usually there is an initial period of high popularity (e.g., a new restaurant is featured in local paper, promotions, etc.). • Good abandonment (no user clicks but good entity in terms of satisfying the user information needs, e.g., phone number). – Use number impressions for named queries. 164
  • 165. References 1. Yaser S. Abu-Mostafa, Malik Magdon-Ismail & Hsuan-Tien Lin (2012) Learning From Data. AMLBook. [link] 2. Ethem Alpaydin (2009) Introduction to Machine Learning. 2nd edition. Adaptive Computation and Machine Learning series. MIT Press. [link] 3. David Barber (2012) Bayesian Reasoning and Machine Learning. Cambridge University Press. [link] 4. Ricardo Baeza-Yates & Berthier Ribeiro-Neto (2011) Modern Information Retrieval: The Concepts and Technology behind Search. 2nd Edition. ACM Press Books. [link] 5. Alex Bellos (2010) Alex's Adventures in Numberland. Bloomsbury: New York. [link] 6. Ron Bekkerman, Mikhail Bilenko & John Langford (2011) Scaling up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press. [link] 7. Christopher M. Bishop (2007) Pattern Recognition and Machine Learning. Information Science and Statistics. Springer. [link] 8. George Casella & Roger L. Berger (2001) Statistical Inference. 2nd edition. Duxbury Press. [link] 9. Anirban DasGupta (2011) Probability for Statistics and Machine Learning: Fundamentals and Advanced Topics. Springer Texts in Statistics. Springer. [link] 10. Luc Devroye, László Györfi & Gábor Lugosi (1996) A Probabilistic Theory of Pattern Recognition. Springer. [link] 11. Richard O. Duda, Peter E. Hart & David G. Stork (2000) Pattern Classification. 2nd Edition. Wiley-Interscience. [link] 12. Trevor Hastie, Robert Tibshirani & Jerome Friedman (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd Edition. Springer Series in Statistics. Springer. [link] 13. James L. Johnson (2008) Probability and Statistics for Computer Science. Wiley-Interscience. [link] 14. Daphne Koller & Nir Friedman (2009) Probabilistic Graphical Models: Principles and Techniques. Adaptive Computation and Machine Learning series. MIT Press. [link] 15. David J. C. MacKay (2003) Information Theory, Inference and Learning Algorithms. Cambridge University Press. [link] 16. Zbigniew Michalewicz & David B. Fogel (2004) How to Solve It: Modern Heuristics. 2nd edition. Springer. [link] 17. Tom M. Mitchell (1997) Machine Learning. McGraw-Hill (Science/Engineering/Math). [link] 18. Mehryar Mohri, Afshin Rostamizadeh & Ameet Talwalkar (2012) Foundations of Machine Learning. Adaptive Computation and Machine Learning series. MIT Press. [link] 19. Lior Rokach (2010) Pattern Classification Using Ensemble Methods. World Scientific. [link] 20. Gilbert Strang (1991) Calculus. Wellesley-Cambridge Press. [link] 21. Larry Wasserman (2010) All of Statistics: A Concise Course in Statistical Inference. Springer Texts in Statistics. Springer. [link] 22. Sholom M. Weiss, Nitin Indurkhya & Tong Zhang (2010) Fundamentals of Predictive Text Mining. Texts in Computer Science. Springer. [link] 165
  • 166. Roadmap • Examples of applications of Machine Learning • Encoding objects with features • The Machine Learning framework • Linear models – Perceptron, Winnow , Logistic Regression, Robust Risk Minimization (RRM) • Tree models (Decision Trees DTs) – Classification Decision Trees, Regression Trees • Boosting – AdaBoost • Ranking evaluation – Kendall tau and Spearman’s coefficient • Sequence labeling – Hidden Markov Models (HMMs) 166
  • 167. SEQUENCE LABELING: HIDDEN MARKOV MODELS (HMMs) 167
  • 168. Outline • The guessing game • Tagging preliminaries • Hidden Markov Models • Trellis and the Viterbi algorithm • Implementation (Python) • Complexity of decoding • Parameter estimation and smoothing • Second order models 168
  • 169. The Guessing Game • A cow and duck write an email message together. • Goal – figure out which word is written by which animal. 169 The cow/duck illustration of HMMs is due to Ralph Grishman (NYU).
  • 170. What’s the Big Deal ? • The vocabularies of the cow and the duck can overlap and it is not clear a priori who wrote a certain word! 170
  • 171. The Game (cont) ? ? ? moo hello quack COW ? DUCK moo hello quack 171
  • 172. The Game (cont) DUCK COW COW DUCK moo hello quack 172
  • 173. What about the Rest of the Animals? ANT ANT ANT ANT ANT COW COW COW COW COW DUCK DUCK DUCK DUCK DUCK PIG PIG PIG PIG PIG ZEBRA ZEBRA ZEBRA ZEBRA ZEBRA word1 word2 word3 word4 word5 173
  • 174. A Game for Adults • Instead of guessing which animal is associated with each word guess the corresponding POS tag of a word. Pierre/NNP Vinken/NNP ,/, 61/CD years/NNS old/JJ ,/, will/MD join/VB the/DT board/NN as/IN a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD ./. 174
  • 175. POS Tags "CC", "CD", "DT", "EX", "FW", "IN", "JJ", "JJR", "JJS", "LS", "MD","NN", "NNS","NNP", "NNPS", "PDT", "POS", "PRP", "PRP$", "RB", "RBR", "RBS", "RP", "SYM", "TO", "UH", "VB", "VBD", "VBG", "VBN", "VBP", "VBZ", "WDT", "WP", "WP$", "WRB", "#", "$", ".",",", ":", "(", ")", "`", "``", "'", "''" 175
  • 176. Tagging Preliminaries • We want the best set of tags for a sequence of words (a sentence) • W — a sequence of words • T — a sequence of tags ^ T arg max P(T | W ) T 176
  • 177. Bayes’ Theorem (1763) likelihood prior posterior P(W | T ) P(T ) P(T | W ) P(W ) marginal likelihood Reverend Thomas Bayes — Presbyterian minister (1702-1761) 177
  • 178. Applying Bayes’ Theorem • How do we approach P(T|W) ? • Use Bayes’ theorem P(W | T ) P(T ) arg max P(T | W ) arg max T T P(W ) • So what? Why is it better? • Ignore the denominator (and the question): P(W | T ) P(T ) arg max P(T | W ) arg max arg max P(W | T ) P(T ) T T P(W ) T 178
  • 179. Tag Sequence Probability How do we get the probability P(T) of a specific tag sequence T? • Count the number of times a sequence occurs and divide by the number of sequences of that length — not likely! – Use chain rule 179
  • 180. Chain Rule history P (T ) P (t1 ,..., t n ) P (t1 ) P (t 2 | t1 ) P (t3 | t1t 2 ) ... P (t n | t1 ,..., t n 1 ) P(T) is a product of the probability of the N-grams that make it up Make a Markov assumption: the current tag depends on the previous one only: n P(t1 ,..., tn ) P(t1 ) P(ti | ti 1 ) i 2 180
  • 181. Transition Probabilities • Use counts from a large hand-tagged corpus. • For bi-grams, count all the ti–1 ti pairs c(ti 1ti ) P(ti | ti 1 ) c(ti 1 ) • Some counts are zero – we’ll use smoothing to address this issue later. 181
  • 182. What about P(W|T) ? • First it's odd—it is asking the probability of seeing “The white horse” given “Det Adj Noun”! – Collect up all the times you see that tag sequence and see how often “The white horse” shows up … • Assume each word in the sequence depends only on its corresponding tag: n P(W | T ) P( wi | ti ) i 1 emission probabilities 182
  • 183. Emission Probabilities • What proportion of times is the word wi associated with the tag ti (as opposed to another word): c( wi , ti ) P( wi | ti ) c(ti ) 183
  • 184. The “Standard” Model arg max P (T | W ) T P (W | T ) P (T ) arg max T P (W ) arg max P (W | T ) P (T ) T n arg max P ( wi | ti ) P (ti | ti 1 ) T i 1 184
  • 185. Hidden Markov Models • Stochastic process: A sequence 1 , 2,… of random variables based on the same sample space . • Probabilities for the first observation: P( 1 x j ) for each outcome x j • Next step given previous history: P( t 1 xit 1 | 1 xi1 , ... , t xit ) 185
  • 186. Markov Chain • A Markov Chain is a stochastic process with the Markov property: P( t 1 xit 1 | 1 xi1 , ... , t xit ) P( t 1 xit 1 | t xit ) • Outcomes are called states. • Probabilities for next step – weighted finite state automata. 186
  • 187. State Transitions w/ Probabilities 0.5 COW 0.2 1.0 END START 0.3 0.3 0.2 DUCK 0.5 187
  • 188. Markov Model Markov chain 0.5 moo:0.9 where each state hello:0.1 can output signals COW ^:1.0 0.2 $:1.0 1.0 END START 0.3 0.3 0.2 DUCK (like “Moore machines”): hello:0.4 quack:0.6 0.5 188
  • 189. The Issue Was • A given output symbol can potentially be emitted by more than one state — omnipresent ambiguity in natural language. 189
  • 190. Markov Model Finite set of states: {s1,..., sm} Signal alphabet: { 1,..., k } Transition matrix: P [ pij ] where pij P( t 1 sj | t si ) Emission probabilities: A [aij ] where aij P( t j | t si ) Initial probability vector: v [v1 ,..., vm ] where v j P( 1 sj) 190
  • 191. Graphical Model STATE TAG … OUTPUT word 191
  • 192. Hidden Markov Model • A Markov Model for which it is not possible to observe the sequence of states. • S: unknown — sequence of states S * • O: known — sequence of observations O * arg max P( S | O) S tags words 192
  • 193. The State Space moo:0.9 hello:0.1 0.5 0.5 COW COW COW 1.0 0.2 0.3 0.3 START END 0.3 0.3 0.0 0.2 0.5 0.5 DUCK DUCK DUCK hello:0.4 quack:0.6 moo hello quack More on how the probabilities come about (training) later. 193
  • 194. Optimal State Sequence: The Viterbi Algorithm We define the joint probability of the most likely sequence from time 1 to time t ending in state si and the observed sequence O≤t up to time t: t (i) max P(S t 1 , t si ; O t ) S t 1 max P( 1 si1 ,..., t 1 sit 1 , t si ; O t ) si1 ,..., sit 1 194
  • 195. Key Observation The most likely partial derivation leading to state si at position t consists of: – the most likely partial derivation leading to some state sit-1 at the previous position t-1, – followed by the transition from sit-1 to si. 195
  • 196. Viterbi (cont) Note: 1 (i ) vi aik 1 where vi P( 1 si ) and aik 1 P( t k1 | t si ) We will show that: t ( j ) [max t 1 (i) pij ] a jk t i 196
  • 197. Recurrence Equation t ( j) max P ( S t 1 , t s j ;O t ) S t 1 max max P ( S t 2 , t 1 si , t s j ;O t 1 , t kt ) i S t 2 max max P ( t sj; t kt |S t 2 , t 1 si ; O t 1 ) i S t 2 k1 P( S t 2 , t 1 si ; O t 1 ) max max P ( t sj | t 1 si ) P ( t kt | t sj) i S t 2 P( S t 2 , t 1 si ; O t 1 ) [max P ( t sj | t 1 si ) max P ( S t 2 , t 1 si ; O t )] 1 i S t 2 P( t kt | t sj) [max pij t 1 (i )] a jk t i 197
  • 198. Back Pointers • The predecessor of state si in the path corresponding to t(i) : t ( j ) argmax ( t 1 (i ) pij ) 1 i m • Optimal state sequence: * s kT argmax n (i ) 1 i m * * skt t 1 ( skt 1 ) for t 1, ... , n 1 198
  • 199. The Trellis moo hello quack $ t=0 t=1 t=2 t=3 t=4 START 1 0 0 0 0 COW 0 0.9 0.045 0 0 0.0081 0 DUCK 0 0 0.108 0.0324 END 0 0 0 0 0.00648 199
  • 200. Implementation (Python) observations = ['^','moo','hello','quack','$'] # signal sequence states = ['start','cow','duck','end'] # Transition probabilities - p[FromState][ToState] = probability p = {'start': {'cow':1.0}, 'cow': {'cow' :0.5, 'duck':0.3, 'end' :0.2}, 'duck': {'duck':0.5, 'cow' :0.3, 'end' :0.2}} # Emission probabilities; special emission symbol '$' for 'end' state a = {'cow' : {'moo' :0.9,'hello':0.1, 'quack':0.0, '$':0.0}, 'duck': {'moo' :0.0,'hello':0.4, 'quack':0.6, '$':0.0}, 'end' : {'moo' :0.0,'hello':0.0, 'quack':0.0, '$':1.0}} 200

Editor's Notes

  1. Outcomes:POSITIVENEGATIVEMIXEDNEUTRALUNKNOWN
  2. Polonius, Hamlet Act 2, scene 2, 193–206
  3. Or the Density(entity) can be normalized click through rate for entities near the given input entity.
  4. N parts whose order is then reversed:Spearman 2 parts : -0.500416782439567Spearman 3 parts: -0.778271742Spearman 4 parts: -0.875520978049458150597Spearman 5 parts: -0.920533481522645Spearman 6 parts: -0.944984717977216 Spearman 10 parts: -0.980550152820228Spearman 15 parts: -0.991664351208669
  5. N parts whose order is then reversed:Kendall 2 parts: -0.0169491525423728Kendall 3 parts: -0.35593220338983Kendall 4 parts: -0.525423728813559Kendall 5 parts: -0.627118644067797Kendall 6 parts: -0.694915254237288Kendall 10 parts: -0.830508474576271Kendall 15 parts: -0.898305084745763
  6. Josiah Godfrey (for data).Patrick Haluptzok on matching.