SlideShare una empresa de Scribd logo
1 de 59
Descargar para leer sin conexión
DBM630: Data Mining and
                       Data Warehousing

                              MS.IT. Rangsit University
                                                 Semester 2/2011



                                                 Lecture 6
                     Classification and Prediction
             Decision Tree and Classification Rules

    by Kritsada Sriphaew (sriphaew.k AT gmail.com)

1
Topics
 What Is Classification, What Is Prediction?
 Decision Tree
 Classification Rule: Covering Algorithm




 2                          Data Warehousing and Data Mining by Kritsada Sriphaew
What Is Classification?
       Case
         A bank loans officer needs analysis of her data in order to learn
          which loan applicants are “safe” and which are “risky” for the bank
         A marketing manager needs data analysis to help guess whether a
          customer with a given profile will buy a new computer or not
         A medical researcher wants to analyze breast cancer data in order
          to predict which one of three specific treatments a patient receive

       The data analysis task is classification, where the model or
        classifier is constructed to predict categorical labels
       The model is a classifier


    3
                                         Data Warehousing and Data Mining by Kritsada Sriphaew
What Is Prediction?
       Suppose that the marketing manager would like to predict how
        much a given customer will spend during a sale at the shop
       This data analysis task is numeric prediction, where the model
        constructed predicts a continuous value or ordered values, as
        opposed to a categorical label
       This model is a predictor
       Regression analysis is a statistical methodology that is most often
        used for numeric prediction




    4
                                        Data Warehousing and Data Mining by Kritsada Sriphaew
How does classification work?
       Data classification is a two-step process
       In the first step, -- learning step or training phase
           A model is built describing a predetermined set of data classes or concepts
           Data tuples used to build the classification model are called training data set
           If the class label is provided, this step is known as supervised learning, otherwise
            called unsupervised learning
           The learned model may be represented in the form of classification rules,
            decision trees, Bayesian, mathematical formulae, etc.




    5
                                                   Data Warehousing and Data Mining by Kritsada Sriphaew
How does classification work?
       In the second step,
           The learned model is used for classification
           Estimate the predictive accuracy of the model using hold-out data set (a test set
            of class-labeled samples which are randomly selected and are independent of the
            training samples)
           If the accuracy of the model were estimate based on the training data set -> the
            model tends to overfit the data
           If the accuracy of the model is considered acceptable, the model can be used to
            classify future data tuples or objects for which the class label is unknown
           In the experiment, there are three kinds of dataset, training data set, hold-out
            data set (or validation data set), and test data set




    6
                                                 Data Warehousing and Data Mining by Kritsada Sriphaew
Issues Regarding Classification/Prediction
       Comparing classification methods
           The criteria to compare and evaluate classification and prediction methods
           Accuracy: an ability of a given classifier to correctly predict the class label of new
            or unseen data
           Speed: the computation costs involved in generating and using the given classifier
            or predictor
           Robustness: an ability of the classifier or predictor to make correct predictions
            given noisy data or data with missing values
           Scalability: an ability to construct the classifier or predictor efficiently given large
            amounts of data
           Interpretability: the level of understanding and insight that is provided by the
            classifier or predictor – subjective and more difficult to assess



    7
                                                    Data Warehousing and Data Mining by Kritsada Sriphaew
Decision Tree
       A decision tree is a flow-chart-like tree structure,
            each internal node denotes a test on an attribute,
            each branch represents an outcome of the test
            leaf node represent classes
            Top-most node in a tree is the root node
            Instead of using the complete set of features jointly to make a decision, different
             subsets of features are used at different levels of the tree during making a decision

                                      Age?                                  The decision tree
                       <=30                         >40                     represents the concept
                                    31…40                                   buys_computer
                   student?                         Credit_rating?
                                        yes
             no             yes                  excellent           fair

            no                yes                 no                  yes



    8                                                                          Classification – Decision Tree
Decision Tree Induction
       Normal procedure: greedy algorithm by top down
        in recursive divide-and-conquer fashion
           First: attribute is selected for root node and branch is
            created for each possible attribute value
           Then: the instances are split into subsets (one for each
            branch extending from the node)
           Finally: procedure is repeated recursively for each
            branch, using only instances that reach the branch
           Process stops if
               All instances for a given node belong to the same class
               No remaining attribute on which the samples may be further partitioned  majority vote is
                employed
               No sample for the branch to test the attrbiute  majority vote is employed


    9                                                                            Classification – Decision Tree
Decision Tree Representation
(An Example)
         The decision tree (DT) of the weather example is:
Outlook Temp. Humid. Windy Play                        Decision Tree
 sunny   hot   high   false N                           Induction
 sunny   hot   high   true  N
overcast hot   high   false Y
 rainy mild high      false Y
 rainy   cool normal false  Y
                                                   outlook
 rainy   cool normal true   N
overcast cool normal true   Y              sunny                rainy
 sunny mild high      false N                      overcast

 sunny cool normal false    Y         humidity                    windy
                                                      yes
 rainy mild normal false    Y         high   normal            false      true
 sunny mild normal true     Y
overcast mild high    true  Y         no       yes             yes          no
overcast hot normal false   Y
 rainy mild high
  10                  true  N                        Classification – Decision Tree
An Example
(Which attribute is the best?)
         There are four possibilities for each split




 11                                       Classification – Decision Tree
Criterions for Attribute Selection
   Which is the best attribute?
       The one which will result in the smallest tree
       Heuristic: choose the attribute that produces the “purest”
        nodes
   Popular impurity criterion: information gain
       Information gain increases with the average purity
        of the subsets that an attribute produces
   Strategy: choose the attribute with the highest
    information gain is chosen as the test attribute
    for the current nodes
12                                                       Classification – Decision Tree
Computing “Information”
    Information is measured in bits
        Given a probability distribution, the information required to predict an event is
         the distribution‟s entropy
        Entropy gives the information required in bits (this can involve fractions of bits!)
        Information gain measures the goodness of split
    Formula for computing expected information:
        Let S be a set consisting of s data instances, the class label attribute has n distinct
         classes, Ci (for i = 1, …, n)
        Let si be the number of instances in class Ci
        The expected information or entropy is

             info([s1,s2,…,sn]) = entropy(s1/s, s2/s, sn/s) = - S pi(log2 pi)
             where pi is the probability that the instance belongs to class, pi = si/s

    Formula for computing information gain:
        Find an information gain of attribute A

             gain(A) = info. before splitting – info. after splitting

    13                                                                                   Classification – Decision Tree
Expected Information for “Outlook”
    “Outlook” = “sunny”:
         info([2,3]) = entropy(2/5,3/5) = -(2/5)log2(2/5) - (3/5)log2(3/5)
                    = 0.971 bits
     “Outlook” = “overcast”:
                                                                            Outlook Temp. Humid. Windy Play
                                                                            sunny     hot     high    false   N

         info([4,0]) = entropy(1,0) = -(1)log2(1) - (0)log2(0)               sunny
                                                                            overcast
                                                                                       hot
                                                                                       hot
                                                                                               high
                                                                                               high
                                                                                                       true
                                                                                                       false
                                                                                                               N
                                                                                                               Y
                    = 0 bits                                                 rainy     mild    high    false   Y

    “Outlook” = “rainy”:                                                    rainy
                                                                             rainy
                                                                                       cool
                                                                                       cool
                                                                                              normal
                                                                                              normal
                                                                                                       false
                                                                                                       true
                                                                                                               Y
                                                                                                               N

         info([3,2]) = entropy(3/5,2/5)                                     overcast   cool   normal   true    Y

                     = - (3/5)log2(3/5) - (2/5)log2(2/5)                     sunny     mild    high    false   N
                                                                             sunny     cool   normal   false   Y
                     = 0.971 bits                                            rainy     mild   normal   false   Y

    Expected information for attribute “Outlook”:                           sunny
                                                                            overcast
                                                                                       mild
                                                                                       mild
                                                                                              normal
                                                                                               high
                                                                                                       true
                                                                                                       true
                                                                                                               Y
                                                                                                               Y
         info([2,3],[4,0],[3,2])                                      overcast hot  normal             false   Y

                     = (5/14)info([2,3]) + (4/14)info([4,0]) + (5/14)info([3,2])
                                                                        rainy  mild  high              true    N

                     = [ (5/14)0.971 ] +[ (4/14)x0 ] +[ (5/14)x0.971 ]
                     = 0.693 bits


    14                                                                Classification – Decision Tree
Information Gain for “Outlook”
    Information gain:
      info. before splitting – info. after splitting
     gain(”Outlook”) = info([9,5]) - info([2,3],[4,0],[3,2])
                       = 0.940-0.693
                       = 0.247 bits
    Information gain for attributes from weather data:
     gain(”Outlook”)           = 0.247 bits
     gain(”Temperature”)       = 0.029 bits
     gain(“Humidity”)          = 0.152 bits
      gain(“Windy”)            = 0.048 bits


    15                                            Classification – Decision Tree
An Example of Gain Criterion
(Which attribute is the best?)
                   Gain(outlook) =                            Gain(humidity) =
                            info([9,5]) -                         info([9,5]) -
                            info([2,3],[4,0],[3,2])               info([3,4],[6,1])
                            = 0.247                               = 0.152


                                     The best



                   Gain(humidity) =
                       info([9,5]) -                  Gain(outlook) =
                       info([6,2],[3,3])                       info([9,5]) -
                       = 0.048                                 info([2,2],[4,2],[3,1])
                                                               = 0.029




 16                                                      Classification – Decision Tree
Continuing to Split




 If “Outlook” = “sunny”
 gain(”Temperature”)      = 0.571 bits
 gain(“Humidity”)         = 0.971 bits
 gain(“Windy”)            = 0.020 bits
17                                       Classification – Decision Tree
The Final Decision Tree




Note: not all leaves need to be pure; sometimes identical
instances have different classes
          Splitting stops when data can‟t be split any further


   18                                              Classification – Decision Tree
Properties for a Purity Measure
   Properties we require from a purity measure:
       When node is pure, measure should be zero
       When impurity is maximal (i. e. all classes equally likely),
        measure should be maximal
       Measure should obey multistage property (i. e. decisions can be
        made in several stages):

    measure([2,3,4]) =
                measure([2,7]) + (7/9) measure([3,4])

   Entropy is the only function that satisfies all
    three properties!
19                                                     Classification – Decision Tree
Some Properties for the Entropy
    The multistage property:
           entropy(p,q,r) = entropy(p,q+r) + [(q+r)/(p+q+r)] × entropy(q, r)
    Ex.: info(2,3,4) can be calculated as
           = {- (2/9)log2(2/9) + (7/9)log2(7/9)}–{7/9}*{[(3/7)log2(3/7) + (4/7)log2(4/7)}
           = - (2/9)log2(2/9) - (7/9) [ log2 (7/9) + (3/7)log2(3/7) + (4/7)log2(4/7) ]
           = - (2/9)log2(2/9)
             - (7/9) [ (3/7)log2(7/9) + (4/7)log2(7/9) +(3/7)log2(3/7) + (4/7)log2(4/7) ]
           = - (2/9)log2(2/9)
             - (7/9) [ (3/7)log2(7/9) + (3/7)log2(3/7) +(4/7)log2(7/9) + (4/7)log2(4/7) ]
           = - (2/9)log2(2/9)
             - (7/9) [ (3/7)log2(7/9 x 3/7) + (4/7)log2(7/9 x 4/7) ]
           = - (2/9)log2(2/9) - (7/9) [ (3/7)log2(3/9) + (4/7)log2(4/9) ]
           = - (2/9)log2(2/9) - (3/9)log2(3/9) - (4/9)log2(4/9)



    20                                                            Classification – Decision Tree
A Problem: Highly-Branching Attributes
 Problematic: attributes with a large number of
  values (extreme case: ID code)
 Subsets are more likely to be pure if there is a
  large number of values
 Information gain is biased towards choosing
  attributes with a large number of values
 This may result in overfitting (selection of an
  attribute that is non-optimal for prediction) and
  fragmentation

21                                    Classification – Decision Tree
Example: Highly-Branching Attributes
ID        Outlook Temp. Humid. Windy Play
 A         sunny   hot   high   false N                          ID
 B         sunny   hot   high   true  N
                                                   A                            N
 C        overcast hot   high   false Y                      B        M
 D         rainy mild high      false Y
                                              no       yes                yes       no
 E         rainy   cool normal false  Y
 F         rainy   cool normal true   N
                                            Entropy Split
 G        overcast cool normal true   Y     info(ID)
 H         sunny mild high      false N               = info([0,1],[0,1],
 I         sunny cool normal false    Y            [1,0],…,[0,1])
 J         rainy mild normal false    Y               = 0 bits
 K         sunny mild normal true     Y     gain(ID) = 0.940 (max.)
 L        overcast mild high    true  Y
M         overcast hot normal false   Y
 N         rainy mild high      true  N

     22                                                  Classification – Decision Tree
Modification: The Gain Ratio As a Split
Info.
 Gain ratio: a modification of the information
  gain that reduces its bias
 Gain ratio takes number and size of branches
  into account when choosing an attribute
       It corrects the information gain by taking the
        intrinsic information of a split into account
   Intrinsic information: entropy of distribution of
    instances into branches
    (i.e. how much info do we need to tell which branch an
    instance belongs to)

23                                            Classification – Decision Tree
Computing the Gain Ratio
    Example: intrinsic information (split info) for ID
     code
     info([1,1,…,1] = 14*( (-1/14)log(1/14) ) =3.807
    Value of attribute decreases as intrinsic
     information gets larger
    Definition of gain ratio:
     gain_ratio(“Attribute”) =   gain(“Attribute”)
                                  intrinsic_info (“Attribute”)
    Example:
     gain_ratio(“ID”) =   gain(“ID”)         = 0.970 bits
                          intrinsic_info (“ID”)  3.807 bits
                          = 0.246

    24                                                  Classification – Decision Tree
Gain Ratio for Weather Data




25                            Classification – Decision Tree
Gain Ratio for Weather Data(Discussion)
 “Outlook” still comes out top
 However: “ID” has greater gain ratio
       Standard fix: ad hoc test to prevent splitting on
        that type of attribute
   Problem with gain ratio: it may
    overcompensate
       May choose an attribute just because its intrinsic
        information is very low
       Standard fix: only consider attributes with greater
        than average information gain

26                                            Classification – Decision Tree
Avoiding Overfitting the Data
 The naïve DT algorithm grows each branch of
  the tree just deeply enough to perfectly classify
  the training examples.
 This algorithm may produce trees that overfit
  the training examples but do not work well for
  general cases.
 Reason: the training set may has some noises
  or it is too small to produce a representative
  sample of the true target tree (function).

27                                    Classification – Decision Tree
Avoid Overfitting: Pruning
    Pruning simplifies a decision tree to prevent overfitting to noise
     in the data
    Two main pruning strategies:
        1. Prepruning: stops growing a tree when no statistically significant
         association between any attribute and the class at a particular node.
             Most popular test: chi-squared test, only statistically significant
         attributes where allowed to be selected by information gain procedure
         2. Postpruning: takes a fully-grown decision tree and discards unreliable
         parts by two main pruning operations, i.e., subtree replacement and
         subtree raising with some possible strategies, e.g., error estimation,
         significance testing, MDL principle.
    Prepruning is preferred in practice because of early stopping


    28                                                        Classification – Decision Tree
Subtree Replacement
    Bottom-up: tree is considered for replacement once all its
     subtrees have been considered




    29                                          Classification – Decision Tree
Subtree Raising
    Deletes node and redistributes instances
    Slower than subtree replacement (Worthwhile?)




    30                                       Classification – Decision Tree
Tree to Rule vs. Rule to Tree
  Tree          outlook                         Rule
                                           If outlook=sunny & humidity=high then class=no
       sunny              rainy            If outlook=sunny & humidity=normal then class=yes
               overcast
  humidity                  windy          If outlook=overcast then class=yes
                  yes                      If outlook=rainy & windy=false then class=yes
                                           If outlook=rainy & windy=true then class=no
high        normal false            true

  no        yes           yes       no

        Rule                                                  Tree

                                                                        ?
  If outlook=sunny & humidity=high then class=no
  If humidity=normal then class=yes

  If outlook=overcast then class=yes
  If outlook=rainy & windy=true then class=no


                        outlook=rainy & windy=true & humidity=normal  ?
   Question:            outlook=rainy & windy=false & humidity=high  ?

       31                                                                  Classification Rules
Classification Rule: Algorithms
    Two main algorithms are:

    Inferring Rudimentary rules
      1R: 1-level decision tree


    Covering Algorithms:
      Algorithm to construct the rules
      Pruning Rules & Computing Significance
         Hypergeometric Distribution vs. Binomial Distribution
      Incremental Reduce-Error Pruning




    32                                                  Classification Rules
(Holte, 93)

Inferring Rudimentary Rules (1R rule)
 1R learns a 1-level decision tree
      Generate a set of rules that all test on one particular attribute
      Focus on each attribute
    Pseudo-code
         • For each attribute,
         • For each value of the attribute, make a rule as
           follows:
            • count how often each class appears
            • find the most frequent class
            • make the rule assign that class to this
              attribute-value
            • Calculate the error rate of the rules
            • Choose the rules with the smallest error rate

    Note: “missing” can be treated as a separate attribute value
    1R’s simple rules performed not much worse than much more complex
     decision trees.

    33                                                           Classification Rules
An Example: Evaluating the Weather
Attributes (Nominal, Ordinal)
Outlook    Temp.   Humidity Windy Play    Attribute          Rule          Error    Total Error

 sunny      hot     high    false   no    Outlook     O = sunny  no        2/5         4/14
                                          (O)         O = overcast  yes    0/4
 sunny      hot     high    true    no
                                                      O = rainy  yes       2/5
overcast    hot     high    false   yes   Temp.       T = hot  no          2/4         5/14
 rainy     mild     high    false   yes   (T)         T = mild  yes        2/6
 rainy     cool    normal   false   yes               T = cool  yes        1/4
                                          Humidity    H = high  no         3/7         4/14
 rainy     cool    normal   true    no
                                          (H)         H = normal  yes      1/7
overcast   cool    normal   true    yes
                                          Windy       W = false  yes       2/8         5/14
 sunny     mild     high    false   no    (W)         W = true  no         3/6
 sunny     cool    normal   false   yes
 rainy     mild    normal   false   yes           1R chooses the attribute that
 sunny     mild    normal   true    yes           produces rules with the smallest
overcast   mild     high    true    yes
                                                  number of errors, i.e., rule sets
                                                  of attribute “Outlook” or
overcast    hot    normal   false   yes
                                                  “Humidity”
 rainy     mild     high    true    no

  34                                                                       Classification Rules
An Example: Evaluating the Weather
  Attributes (Numeric)
Outlook Temp. Humidity Windy   Play   Attribute          Rule          Error      Total
 sunny     85   85     false   no                                                 Error

 sunny     80   90     true    no     Outlook     O = sunny  no         2/5       4/14
                                      (O)         O = overcast  yes     0/4
overcast   83   86     false   yes                O = rainy  yes        2/5
 rainy     70   96     false   yes
                                      Temp.       T <= 77.5  yes       3/10       5/14
 rainy     68   80     false   yes    (T)         T > 77.5  no          2/4
 rainy     65   70     true    no     Humidity    H <= 82.5  yes        1/7       3/14
overcast   64   65     true    yes    (H)         82.5<H<=95.5  no      2/6
                                                  H > 95.5  yes         0/1
 sunny     72   95     false   no
                                      Windy       W = false  yes        2/8       5/14
 sunny     69   70     false   yes
                                      (W)         W = true  no          3/6
 rainy     75   80     false   yes
 sunny     75   70     true    yes      1R chooses the attribute that
overcast   72   90     true    yes      produces rules with the smallest
overcast   81   75     false   yes
                                        number of errors, i.e., rule set of
                                        attribute “Humidity”
 rainy     71   91     true    no
    35                                                              Classification Rules
Dealing with Numeric Attributes
        Numeric attributes are discretized: the range of the
         attribute is divided into a set of intervals
             Instances are sorted according to attribute’s values
             Breakpoints are placed where the (majority) class changes
              (so that the total error is minimized)
        Example: Temperature from weather data
               Left-to-right
64 65 68              69   70 71      72    72 75      75 80 81      83 85
Y | N | Y             Y    Y | N      N     Y | Y      Y | N | Y     Y | N              min=3

64       65     68    69   70 71      72    72   75    75 80    81   83    85
Y        N       Y     Y    Y | N      N     Y    Y     Y | N   Y    Y     N
                                                                                        Merge
64       65     68    69   70    71   72    72   75    75 80    81   83    85           same
Y        N       Y     Y    Y     N    N     Y   Y     Y | N     Y    Y    N           category

     36                                                                   Classification Rules
Separate-and-conquer: selects the test that
                                                          maximizes the number of covered positive examples

Covering Algorithm                                        and minimizes the number of negative examples
                                                          that pass the test. It usually does not pay any
                                                          attention to the examples that do not pass the test.
    Separate-and-conquer algorithm                       Divide-and-conquer: optimize for all outcomes of
    Focus on each class in turn                          the test.

    Seek a way to covering all instances in the class
    More rules could be added for perfect rule set
    Comparing to decision tree (DT):
        Decision tree
            Divide-and-conquer
            Focus on all classes at each step
            Seek an attribute to split on that best separates the classes
        DT can be converted into a rule set
          Straightforward conversion: rule set overly                     complex
          More effective conversions are not trivial
        In multiclass situations, covering algorithm concentrates on
         one class at a time whereas DT learner takes all classes into
         account


    37                                                                               Classification Rules
Constructing Classification Rule
 (An Example)

        y    b a a                     y          b a a                 y       b a a
                a                                    a                             a
            b b a b                              b b a b                       b b a b
                  a                                    a               2.6           a
              b     b                              b     b                       b     b
                bb                                   bb                            bb
                         x                               1.2       x               1.2       x
                                                                                              Instance space
Classification Rules                                                     Rule so far
If x<=1.2 then class = b

If x> 1.2 then class = b
If x> 1.2 & y<=2.6 then class = b

                             x > 1.2
                     n                     y

                    b                          y > 2.6
                                                                         Rule after adding new item
                                       n                   y
             Decision Tree
                                   b                           ?             More rules could be added for
                                                                                  “perfect” rule set
   38
A Simple Covering Algorithm
    Generates a rule by adding tests that maximize rule’s
     accuracy, even each new test reduces the rule’s coverage
    Similar to situation in decision trees: problem of selecting
     an attribute to split
        Decision tree inducer maximizes overall purity.
        Covering algorithm maximizes rule accuracy.
    Goal: maximizing accuracy
        t: total number of instances covered by rule
        p: positive examples of the class covered by rule
        t-p: number of errors made by rule
    One option: select test that maximizes the ratio p/t
    We are finished when p/t = 1 or the set of instances
     cannot be split any further.
    39                                                       Classification Rules
An Example: Contact Lenses Data
     age          Spectacle     astigmati   Tear prod.   Recom.      Age        Spectacle     astigma     Tear prod.   Recom.
                 prescription      sm          rate      lenses                prescription     tism         rate      lenses
    young          myope           no        reduced      none    presbyopic     myope          no         reduced      none
    young          myope           no        normal       soft    presbyopic     myope          no         normal       none
    young          myope          yes        reduced      none    presbyopic     myope          yes        reduced      none
    young          myope          yes        normal       hard    presbyopic     myope          yes        normal       hard
    young        hypermyope        no        reduced      none    presbyopic   hypermyope       no         reduced      none
    young        hypermyope        no        normal       soft    presbyopic   hypermyope       no         normal       soft
    young        hypermyope       yes        reduced      none    presbyopic   hypermyope       yes        reduced      none
    young        hypermyope       yes        normal       hard    presbyopic   hypermyope       yes        normal       none
pre-presbyopic     myope           no        reduced      none
pre-presbyopic     myope           no        normal       soft
pre-presbyopic     myope          yes        reduced      none
pre-presbyopic     myope          yes        normal       hard
pre-presbyopic   hypermyope        no        reduced      none
pre-presbyopic   hypermyope        no        normal       soft
pre-presbyopic   hypermyope       yes        reduced      none
                                                                      First try to find a rule for “hard”
pre-presbyopic   hypermyope       yes        normal       none



       40                                                                                            Classification Rules
An Example: Contact Lenses Data
(Finding a good choice)

   Rule we seek:
        If ? then recommendation = hard
   Possible tests:
    Age = Young                               2/8
    Age = Pre- presbyopic                     1/8
    Age = Presbyopic                          1/8
    Spectacle prescription   = Myope          3/12
    Spectacle prescription   = Hypermetrope   1/12
    Astigmatism = no                          0/12
    Astigmatism = yes                         4/12
    Tear production rate =   Reduced          0/12
    Tear production rate =   Normal           4/12              OR



 41                                            Classification Rules
Modified Rule and Resulting Data
          Rule with best test added:
            If astigmatics = yes then recommendation = hard
     age          Spectacle     astigmati   Tear prod.   Recom.      Age        Spectacle     astigma     Tear prod.   Recom.
                 prescription      sm          rate      lenses                prescription     tism         rate      lenses
    young          myope           no        reduced      none    presbyopic     myope          no         reduced      none
    young          myope           no        normal       soft    presbyopic     myope          no         normal       none
    young          myope          yes        reduced      none    presbyopic     myope          yes        reduced      none
    young          myope          yes        normal       hard    presbyopic     myope          yes        normal       hard
    young        hypermyope        no        reduced      none    presbyopic   hypermyope       no         reduced      none
    young        hypermyope        no        normal       soft    presbyopic   hypermyope       no         normal       soft
    young        hypermyope       yes        reduced      none    presbyopic   hypermyope       yes        reduced      none
    young        hypermyope       yes        normal       hard    presbyopic   hypermyope       yes        normal       none
pre-presbyopic     myope           no        reduced      none
pre-presbyopic     myope           no        normal       soft    • The underlined rows match with the
pre-presbyopic     myope          yes        reduced      none      rule.
pre-presbyopic     myope          yes        normal       hard    • Anyway, we need to refine the rule
pre-presbyopic   hypermyope        no        reduced      none      since they are not all correct,
pre-presbyopic   hypermyope        no        normal       soft      according to the rule.
pre-presbyopic   hypermyope       yes        reduced      none
        42
pre-presbyopic   hypermyope       yes        normal       none                                       Classification Rules
Further Refinement
   Current State:
    If astigmatism = yes and ? then recommendation = hard
   Possible tests:
    Age = Young                                     2/4
    Age = Pre- presbyopic                           1/4
    Age = Presbyopic                                1/4
    Spectacle prescription     = Myope              3/6
    Spectacle prescription     = Hypermetrope       1/6
    Tear production rate =     Reduced              0/6
    Tear production rate =     Normal               4/6




43                                                    Classification Rules
Modified Rule and Resulting Data
          Rule with best test added:
     If astigmatics = yes and tear prod. rate = normal then
        recommendation = hard
     age          Spectacle     astigmati   Tear prod.   Recom.
                 prescription      sm          rate      lenses
                                                                     Age        Spectacle     astigma   Tear prod.   Recom.
    young          myope           no        reduced      none                                  tism
                                                                               prescription                rate      lenses
    young          myope           no        normal       soft
                                                                  presbyopic     myope          no       reduced      none
    young          myope          yes        reduced      none
                                                                  presbyopic     myope          no       normal       none
    young          myope          yes        normal       hard
                                                                  presbyopic     myope          yes      reduced      none
    young        hypermyope        no        reduced      none
                                                                  presbyopic     myope          yes      normal       hard
    young        hypermyope        no        normal       soft
                                                                  presbyopic   hypermyope       no       reduced      none
    young        hypermyope       yes        reduced      none
                                                                  presbyopic   hypermyope       no       normal       soft
    young        hypermyope       yes        normal       hard
                                                                  presbyopic   hypermyope       yes      reduced      none
pre-presbyopic     myope           no        reduced      none
                                                                  presbyopic   hypermyope       yes      normal       none
pre-presbyopic     myope           no        normal       soft
pre-presbyopic     myope          yes        reduced      none    • The underlined rows match with
pre-presbyopic     myope          yes        normal       hard      the rule.
pre-presbyopic   hypermyope        no        reduced      none
                                                                  • Anyway, we need to refine the rule
pre-presbyopic   hypermyope        no        normal       soft
                                                                    since they are not all correct,
pre-presbyopic   hypermyope       yes        reduced      none
        44
pre-presbyopic   hypermyope       yes        normal       none
                                                                    according to the rule.
                                                                          Classification Rules: Covering Algorithm
Further Refinement
   Current State:
If astigmatism = yes and tear prod. rate = normal and ? then
   recommendation = hard
   Possible tests:
    Age = Young                                     2/2
    Age = Pre- presbyopic                           1/2
    Age = Presbyopic                                1/2
    Spectacle prescription = Myope                  3/3
    Spectacle prescription = Hypermetrope           1/3
   Tie between the first and the fourth test
       We choose the one with greater coverage

 45                                                   Classification Rules
Modified Rule and Resulting Data
  Final rule with best test added:
If astigmatics = yes and tear prod.rate = normal and
  spectacle prescription = myope then recommendation = hard
     age          Spectacle     astigmati   Tear prod.   Recom.
                 prescription      sm          rate      lenses
                                                                     Age        Spectacle     astigma     Tear prod.   Recom.
    young          myope           no        reduced      none                                  tism
                                                                               prescription                  rate      lenses
    young          myope           no        normal       soft
                                                                  presbyopic     myope          no         reduced      none
    young          myope          yes        reduced      none
                                                                  presbyopic     myope          no         normal       none
    young          myope          yes        normal       hard
                                                                  presbyopic     myope          yes        reduced      none
    young        hypermyope        no        reduced      none
                                                                  presbyopic     myope          yes        normal       hard
    young        hypermyope        no        normal       soft
                                                                  presbyopic   hypermyope       no         reduced      none
    young        hypermyope       yes        reduced      none
                                                                  presbyopic   hypermyope       no         normal       soft
    young        hypermyope       yes        normal       hard
                                                                  presbyopic   hypermyope       yes        reduced      none
pre-presbyopic     myope           no        reduced      none
                                                                  presbyopic   hypermyope       yes        normal       none
pre-presbyopic     myope           no        normal       soft
pre-presbyopic     myope          yes        reduced      none    • The blue rows match with the rule.
pre-presbyopic     myope          yes        normal       hard
                                                                  • All three rows are „hard‟.
pre-presbyopic   hypermyope        no        reduced      none
                                                                  • No need to refine the rule since the
pre-presbyopic   hypermyope        no        normal       soft
pre-presbyopic   hypermyope       yes        reduced      none
                                                                    rule becomes perfect.
     ITS423: Data Warehouses and Data
        46                                                                                           Classification Rules
pre-presbyopic hypermyope yes    normal                   none
     Mining
Finding More Rules
    Second rule for recommending “hard lenses”: (built from
     instances not covered by first rule)
     If age = young and astigmatism = yes and
                    tear production rate = normal then
                                          recommendation = hard
    These astigmatics = yes & tear.prod.rate = lenses”: spectacle.prescr = myope
     (1) If
            two rules cover all “hard normal &
           then recommendation = hard
     (2)   If age = young and astigmatism = yes and tear production rate = normal
           then recommendation = hard


    Process is repeated with other two classes, that is “soft
     lenses” and “none”.

    47                                                              Classification Rules
Pseudo-code for PRISM Algorithm
For each class C
• Initialize E to the instance set
• While E contains instances in class C
   • Create a rule R with an empty left-hand-side that
     predicts class C
   • Until R is perfect (or there are no more
     attributes to use) do
      • For each attribute A not mentioned in R, and
        each value v,
         • Consider adding the condition A = v to the
           left-hand side of R
         • Select A and v to maximize the accuracy p/t
           (break ties by choosing the condition with
           the largest p)
         • Add A = v to R
         • Remove the instances covered by R from E
48                                          Classification Rules
Order Dependency among Rules
   PRISM without outerloop generates a decision list for
    one class
       Subsequent rules are designed for rules that are not
        covered by previous rules
       Here, order does not matter because all rules predict the
        same class
   Outer loop considers all classes separately
       No order dependence implied
   Two problems are
       overlapping rules
       default rule required

49                                                     Classification Rules
Separate-and-Conquer
    Methods like PRISM (for dealing with one class) are separate-and-
     conquer algorithms:
        First, a rule is identified
        Then, all instances covered by the rule are separated out
        Finally, the remaining instances are “conquered”
    Difference to divide-and-conquer methods:
        Subset covered by rule doesn’t need to be explored any further
    Variety in separate-and-conquer approach.
        Search method (e. g. greedy, beam search, ...)
        Test selection criteria (e. g. accuracy, ...)
        Pruning method (e. g. MDL, hold-out set, ...)
        Stopping criterion (e. g. minimum accuracy)
        Post- processing step
    Also: Decision list vs. one rule set for each class

    50                                                                    Classification Rules
Good Rules and Bad Rules
(overview)
    Sometimes it is better not to generate perfect rules that guarantee to give the
     correct classification on all instances in order to avoiding overfitting.
    How do we decide which rules are worthwhile?
    How do we tell when it becomes counterproductive to continue adding terms to
     a rule to exclude a few pecky instances of the wrong type?
    Two main strategies of pruning rules
      Global pruning (post-pruning)
                                                     Create all perfect rules then prune
      Incremental pruning (pre-pruning)
    Three pruning criteria                          Prune a rule when generating

      MDL principle (Minimum Description Length)
                                                                   Rule size + Exception
      Statistical significance  INDUCT
      Error on hold-out set (reduced-error pruning)


    51                                                                 Classification Rules
Hypergeometric Distribution
     The dataset contains T examples

      The rule selects
        t examples


                      The class contains
                         P examples

                                           P       T-P
      The p examples out of t
                                           p       t-p
     examples selected by the
     rule are correctly covered
                                               T
             Hypergeometric Distribution       t

52                                         Classification Rules
Computing Significance
    We want the probability that a random rule does
     at least as well (statistical significance of rule):

                              P  T  P 
                              
                 min(t , P )  
                                          
                               i  t  i 
                                              Or
         m( R )                                                     Ci  T  PCt i
                                                        min(t , P ) P
                                              m( R)      
                                  T 
                                                                            T
                                                          i p                Ct
                   i p
                                   
                                  t
                                   

                p       p!
         Here,   
                q  q!( p  q)!
                
    53                                                           Classification Rules
Good/Bad Rules by Statistical significance
    (An Example)                                                                            “Reduced
                                                                                            probability”
1    If astigmatism = yes then recommendation = hard                                       means better
     success fraction = 4/12                                                               0.047  0.0014
                                            P = p = 4, T = 24, t=12
     no information success fraction = 4/24 4 24−4 1∗ 20         20!
                                             4 12−4       8
     probability of 4/24  4/12 = 0.047        24    = 24! = 8!∗12!                    24!
                                                    12        12!∗12!                12!∗12!
                                                              20! ∗ 12!
                                                          =             = 0.047
     If astigmatism = yes and                                 8! ∗ 24!
2       tear production rate = normal then recommendation = hard
     success fraction = 4/6                        The Best Rule
     no information success fraction = 4/24
     probability of 4/24  4/6 = 0.0014

3    If astigmatism = yes and tear prod. rate = normal and age = young
                 then recommendation = hard                                 0.0014  0.022
     success fraction = 2/2                                                  “Increased
     no information success fraction = 4/24                   P  T  P   probability”
                                                              
                                                              i  t  i 
     probability of 4/24  2/2 = 0.022                                    
                                                                min(t , P )
                                                   m( R)              means worse
                                                                   i p       T 
    54                                                                         
                                                                              t
                                                                               
Good/Bad Rules by Statistical significance
    (Another Example)
     If astigmatism = yes and tear production rate = normal
4                then recommendation = none
     success fraction = 2/6
     no information success fraction = 15/24          Bad Rule
     probability of 15/24  2/6 = 0.985               High Probability

5    If astigmatism = no and tear production rate = normal
                 then recommendation = soft
     success fraction = 5/6
     no information success fraction = 5/24            Good Rule
     probability of 5/24  5/6 = 0.0001                Low Probability

6    If tear production rate = reduced then recommendation = none
     success fraction = 12/12
     no information success fraction = 15/24
     probability of 15/24  12/12 = 0.0017

    55                                                        Classification Rules
The Binomial Distribution
     Approximation: can use sampling with replacement instead of sampling
      without replacement
                                                   Dataset contains T examples




  Rule selects t examples       Class contains P examples




p examples are correctly covered
                                                                                          t i
                                                              t  P   P 
                                               min(t , P )           i

                                     m( R)                    1  
                                                             i  T
                                                  i p           T 
     56                                                            Classification Rules
Pruning Strategies
   For better estimation, a rule should be evaluated on data not used for
    training.
   This requires a growing set and a pruning set
   Two options are
     Reduced-error pruning for rules builds a full unpruned rule set and
       simplifies it subsequently
     Incremental reduced-error pruning simplifies a rule immediately after it
       has been built.




    57                                                          Classification Rules
INDUCT (Incremental Pruning Algorithm)
Initialize E to the instance set
Until E is empty do
  For each class C for which E contains an instance
   Use basic covering algorithm to create best perfect rule for C
   Calculate significance m(R) for rule and significance
        m(R-) for rule with final condition omitted
   If (m(R-) < m(R)), prune rule and repeat previous step
       From the rules for the different classes, select
         the most significant one
         (i.e. the one with smallest m(R))
   Print the rule
   Remove the instances covered by rule from E
Continue

INDUCT’s significance computation for a rule:
•   Probability of completely random rule with same coverage performing at least as well.
•   Random rule R selects t cases at random from the dataset
•   We want to know how likely it is that p of these belong to the correct class?
•   This probability is given by the hypergeometric distribution


     58                                                                    Classification Rules
Example:
Classification task is to predict whether a customer will buy a computer
      RID   age          income   student    Credit_rating    Class:buys_computer
      1     youth        High     No         Fair             No
      2     youth        High     No         Excellent        No
      3     middle_age   High     No         Fair             Yes
      4     senior       Medium   No         Fair             Yes
      5     senior       Low      Yes        Fair             Yes
      6     senior       Low      Yes        Excellent        No
      7     middle_age   Low      Yes        Excellent        Yes
      8     youth        Medium   No         Fair             No
      9     youth        Low      Yes        Fair             Yes
      10    senior       Medium   Yes        Fair             Yes
      11    youth        Medium   Yes        Excellent        Yes
      12    middle_age   Medium   No         Excellent        Yes
      13    middle_age   High     Yes        Fair             Yes
      14    senior       medium   no         Excellent        No

 59
                                            Data Warehousing and Data Mining by Kritsada Sriphaew

Más contenido relacionado

La actualidad más candente

Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)Parth Khare
 
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...Edureka!
 
LE03.doc
LE03.docLE03.doc
LE03.docbutest
 
An introduction to Machine Learning
An introduction to Machine LearningAn introduction to Machine Learning
An introduction to Machine Learningbutest
 
Decision Trees for Classification: A Machine Learning Algorithm
Decision Trees for Classification: A Machine Learning AlgorithmDecision Trees for Classification: A Machine Learning Algorithm
Decision Trees for Classification: A Machine Learning AlgorithmPalin analytics
 
Decision trees & random forests
Decision trees & random forestsDecision trees & random forests
Decision trees & random forestsSC5.io
 
2.8 accuracy and ensemble methods
2.8 accuracy and ensemble methods2.8 accuracy and ensemble methods
2.8 accuracy and ensemble methodsKrish_ver2
 
Mis End Term Exam Theory Concepts
Mis End Term Exam Theory ConceptsMis End Term Exam Theory Concepts
Mis End Term Exam Theory ConceptsVidya sagar Sharma
 
Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...
Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...
Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...Marina Santini
 
Introduction to Some Tree based Learning Method
Introduction to Some Tree based Learning MethodIntroduction to Some Tree based Learning Method
Introduction to Some Tree based Learning MethodHonglin Yu
 
Understanding random forests
Understanding random forestsUnderstanding random forests
Understanding random forestsMarc Garcia
 
Recommendation system using collaborative deep learning
Recommendation system using collaborative deep learningRecommendation system using collaborative deep learning
Recommendation system using collaborative deep learningRitesh Sawant
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster AnalysisSSA KPI
 
Notes 7
Notes 7Notes 7
Notes 7butest
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...Sebastian Raschka
 
Binary Classification with Models and Data Density Distribution by Xuan Chen
Binary Classification with Models and Data Density Distribution by Xuan ChenBinary Classification with Models and Data Density Distribution by Xuan Chen
Binary Classification with Models and Data Density Distribution by Xuan ChenXuan Chen
 

La actualidad más candente (20)

Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
Machine learning basics using trees algorithm (Random forest, Gradient Boosting)
 
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...
Decision Tree Algorithm | Decision Tree in Python | Machine Learning Algorith...
 
Lecture5 - C4.5
Lecture5 - C4.5Lecture5 - C4.5
Lecture5 - C4.5
 
LE03.doc
LE03.docLE03.doc
LE03.doc
 
An introduction to Machine Learning
An introduction to Machine LearningAn introduction to Machine Learning
An introduction to Machine Learning
 
Decision Trees for Classification: A Machine Learning Algorithm
Decision Trees for Classification: A Machine Learning AlgorithmDecision Trees for Classification: A Machine Learning Algorithm
Decision Trees for Classification: A Machine Learning Algorithm
 
Decision trees & random forests
Decision trees & random forestsDecision trees & random forests
Decision trees & random forests
 
2.8 accuracy and ensemble methods
2.8 accuracy and ensemble methods2.8 accuracy and ensemble methods
2.8 accuracy and ensemble methods
 
Mis End Term Exam Theory Concepts
Mis End Term Exam Theory ConceptsMis End Term Exam Theory Concepts
Mis End Term Exam Theory Concepts
 
Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...
Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...
Lecture 02: Machine Learning for Language Technology - Decision Trees and Nea...
 
Decision tree
Decision tree Decision tree
Decision tree
 
Introduction to Some Tree based Learning Method
Introduction to Some Tree based Learning MethodIntroduction to Some Tree based Learning Method
Introduction to Some Tree based Learning Method
 
Understanding random forests
Understanding random forestsUnderstanding random forests
Understanding random forests
 
Recommendation system using collaborative deep learning
Recommendation system using collaborative deep learningRecommendation system using collaborative deep learning
Recommendation system using collaborative deep learning
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysis
 
Notes 7
Notes 7Notes 7
Notes 7
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
 
Classification Continued
Classification ContinuedClassification Continued
Classification Continued
 
Binary Classification with Models and Data Density Distribution by Xuan Chen
Binary Classification with Models and Data Density Distribution by Xuan ChenBinary Classification with Models and Data Density Distribution by Xuan Chen
Binary Classification with Models and Data Density Distribution by Xuan Chen
 
Clustering
ClusteringClustering
Clustering
 

Destacado

Support Vector Machine without tears
Support Vector Machine without tearsSupport Vector Machine without tears
Support Vector Machine without tearsAnkit Sharma
 
Bayesian Networks - A Brief Introduction
Bayesian Networks - A Brief IntroductionBayesian Networks - A Brief Introduction
Bayesian Networks - A Brief IntroductionAdnan Masood
 
Linear regression without tears
Linear regression without tearsLinear regression without tears
Linear regression without tearsAnkit Sharma
 
Basics of Marine Insurance.
Basics of Marine Insurance.Basics of Marine Insurance.
Basics of Marine Insurance.S. M. Gupta
 

Destacado (6)

Information entropy
Information entropyInformation entropy
Information entropy
 
Bayesian networks
Bayesian networksBayesian networks
Bayesian networks
 
Support Vector Machine without tears
Support Vector Machine without tearsSupport Vector Machine without tears
Support Vector Machine without tears
 
Bayesian Networks - A Brief Introduction
Bayesian Networks - A Brief IntroductionBayesian Networks - A Brief Introduction
Bayesian Networks - A Brief Introduction
 
Linear regression without tears
Linear regression without tearsLinear regression without tears
Linear regression without tears
 
Basics of Marine Insurance.
Basics of Marine Insurance.Basics of Marine Insurance.
Basics of Marine Insurance.
 

Similar a DBM630 Data Mining Classification and Prediction

Lt. 5 Pattern Reg.pptx
Lt. 5  Pattern Reg.pptxLt. 5  Pattern Reg.pptx
Lt. 5 Pattern Reg.pptxssuser5c580e1
 
classification in data mining and data warehousing.pdf
classification in data mining and data warehousing.pdfclassification in data mining and data warehousing.pdf
classification in data mining and data warehousing.pdf321106410027
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data MiningKai Koenig
 
Big Data Analytics - Unit 3.pptx
Big Data Analytics - Unit 3.pptxBig Data Analytics - Unit 3.pptx
Big Data Analytics - Unit 3.pptxPlacementsBCA
 
Data Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.pptData Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.pptRvishnupriya2
 
Data Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.pptData Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.pptRvishnupriya2
 
A Decision Tree Based Classifier for Classification & Prediction of Diseases
A Decision Tree Based Classifier for Classification & Prediction of DiseasesA Decision Tree Based Classifier for Classification & Prediction of Diseases
A Decision Tree Based Classifier for Classification & Prediction of Diseasesijsrd.com
 
Data mining approaches and methods
Data mining approaches and methodsData mining approaches and methods
Data mining approaches and methodssonangrai
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Researchjim
 
Data Mining in Market Research
Data Mining in Market ResearchData Mining in Market Research
Data Mining in Market Researchbutest
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Researchkevinlan
 
Machine learning session6(decision trees random forrest)
Machine learning   session6(decision trees random forrest)Machine learning   session6(decision trees random forrest)
Machine learning session6(decision trees random forrest)Abhimanyu Dwivedi
 
5. Machine Learning.pptx
5.  Machine Learning.pptx5.  Machine Learning.pptx
5. Machine Learning.pptxssuser6654de1
 
Analysis of Classification Algorithm in Data Mining
Analysis of Classification Algorithm in Data MiningAnalysis of Classification Algorithm in Data Mining
Analysis of Classification Algorithm in Data Miningijdmtaiir
 
Module 5: Decision Trees
Module 5: Decision TreesModule 5: Decision Trees
Module 5: Decision TreesSara Hooker
 
Machine learning and decision trees
Machine learning and decision treesMachine learning and decision trees
Machine learning and decision treesPadma Metta
 
dataminingclassificationprediction123 .pptx
dataminingclassificationprediction123 .pptxdataminingclassificationprediction123 .pptx
dataminingclassificationprediction123 .pptxAsrithaKorupolu
 

Similar a DBM630 Data Mining Classification and Prediction (20)

Lt. 5 Pattern Reg.pptx
Lt. 5  Pattern Reg.pptxLt. 5  Pattern Reg.pptx
Lt. 5 Pattern Reg.pptx
 
classification in data mining and data warehousing.pdf
classification in data mining and data warehousing.pdfclassification in data mining and data warehousing.pdf
classification in data mining and data warehousing.pdf
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Big Data Analytics - Unit 3.pptx
Big Data Analytics - Unit 3.pptxBig Data Analytics - Unit 3.pptx
Big Data Analytics - Unit 3.pptx
 
Data Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.pptData Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.ppt
 
Data Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.pptData Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.ppt
 
A Decision Tree Based Classifier for Classification & Prediction of Diseases
A Decision Tree Based Classifier for Classification & Prediction of DiseasesA Decision Tree Based Classifier for Classification & Prediction of Diseases
A Decision Tree Based Classifier for Classification & Prediction of Diseases
 
Data mining approaches and methods
Data mining approaches and methodsData mining approaches and methods
Data mining approaches and methods
 
Classification
ClassificationClassification
Classification
 
Classification
ClassificationClassification
Classification
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
 
Data Mining in Market Research
Data Mining in Market ResearchData Mining in Market Research
Data Mining in Market Research
 
Data Mining In Market Research
Data Mining In Market ResearchData Mining In Market Research
Data Mining In Market Research
 
Machine learning session6(decision trees random forrest)
Machine learning   session6(decision trees random forrest)Machine learning   session6(decision trees random forrest)
Machine learning session6(decision trees random forrest)
 
5. Machine Learning.pptx
5.  Machine Learning.pptx5.  Machine Learning.pptx
5. Machine Learning.pptx
 
Analysis of Classification Algorithm in Data Mining
Analysis of Classification Algorithm in Data MiningAnalysis of Classification Algorithm in Data Mining
Analysis of Classification Algorithm in Data Mining
 
Module 5: Decision Trees
Module 5: Decision TreesModule 5: Decision Trees
Module 5: Decision Trees
 
Unit 3classification
Unit 3classificationUnit 3classification
Unit 3classification
 
Machine learning and decision trees
Machine learning and decision treesMachine learning and decision trees
Machine learning and decision trees
 
dataminingclassificationprediction123 .pptx
dataminingclassificationprediction123 .pptxdataminingclassificationprediction123 .pptx
dataminingclassificationprediction123 .pptx
 

Más de Tokyo Institute of Technology

Lecture 4 online and offline business model generation
Lecture 4 online and offline business model generationLecture 4 online and offline business model generation
Lecture 4 online and offline business model generationTokyo Institute of Technology
 

Más de Tokyo Institute of Technology (18)

Lecture 4 online and offline business model generation
Lecture 4 online and offline business model generationLecture 4 online and offline business model generation
Lecture 4 online and offline business model generation
 
Lecture 4: Brand Creation
Lecture 4: Brand CreationLecture 4: Brand Creation
Lecture 4: Brand Creation
 
Lecture3 ExperientialMarketing
Lecture3 ExperientialMarketingLecture3 ExperientialMarketing
Lecture3 ExperientialMarketing
 
Lecture3 Tools and Content Creation
Lecture3 Tools and Content CreationLecture3 Tools and Content Creation
Lecture3 Tools and Content Creation
 
Lecture2: Innovation Workshop
Lecture2: Innovation WorkshopLecture2: Innovation Workshop
Lecture2: Innovation Workshop
 
Lecture0: introduction Online Marketing
Lecture0: introduction Online MarketingLecture0: introduction Online Marketing
Lecture0: introduction Online Marketing
 
Lecture2: Marketing and Social Media
Lecture2: Marketing and Social MediaLecture2: Marketing and Social Media
Lecture2: Marketing and Social Media
 
Lecture1: E-Commerce Business Model
Lecture1: E-Commerce Business ModelLecture1: E-Commerce Business Model
Lecture1: E-Commerce Business Model
 
Lecture0: Introduction Social Commerce
Lecture0: Introduction Social CommerceLecture0: Introduction Social Commerce
Lecture0: Introduction Social Commerce
 
Dbm630 lecture10
Dbm630 lecture10Dbm630 lecture10
Dbm630 lecture10
 
Dbm630 lecture09
Dbm630 lecture09Dbm630 lecture09
Dbm630 lecture09
 
Dbm630 lecture08
Dbm630 lecture08Dbm630 lecture08
Dbm630 lecture08
 
Dbm630 lecture07
Dbm630 lecture07Dbm630 lecture07
Dbm630 lecture07
 
Dbm630 lecture05
Dbm630 lecture05Dbm630 lecture05
Dbm630 lecture05
 
Dbm630 lecture04
Dbm630 lecture04Dbm630 lecture04
Dbm630 lecture04
 
Dbm630_lecture02-03
Dbm630_lecture02-03Dbm630_lecture02-03
Dbm630_lecture02-03
 
Coursesyllabus_dbm630
Coursesyllabus_dbm630Coursesyllabus_dbm630
Coursesyllabus_dbm630
 
Dbm630_lecture01
Dbm630_lecture01Dbm630_lecture01
Dbm630_lecture01
 

Último

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Último (20)

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

DBM630 Data Mining Classification and Prediction

  • 1. DBM630: Data Mining and Data Warehousing MS.IT. Rangsit University Semester 2/2011 Lecture 6 Classification and Prediction Decision Tree and Classification Rules by Kritsada Sriphaew (sriphaew.k AT gmail.com) 1
  • 2. Topics  What Is Classification, What Is Prediction?  Decision Tree  Classification Rule: Covering Algorithm 2 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 3. What Is Classification?  Case  A bank loans officer needs analysis of her data in order to learn which loan applicants are “safe” and which are “risky” for the bank  A marketing manager needs data analysis to help guess whether a customer with a given profile will buy a new computer or not  A medical researcher wants to analyze breast cancer data in order to predict which one of three specific treatments a patient receive  The data analysis task is classification, where the model or classifier is constructed to predict categorical labels  The model is a classifier 3 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 4. What Is Prediction?  Suppose that the marketing manager would like to predict how much a given customer will spend during a sale at the shop  This data analysis task is numeric prediction, where the model constructed predicts a continuous value or ordered values, as opposed to a categorical label  This model is a predictor  Regression analysis is a statistical methodology that is most often used for numeric prediction 4 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 5. How does classification work?  Data classification is a two-step process  In the first step, -- learning step or training phase  A model is built describing a predetermined set of data classes or concepts  Data tuples used to build the classification model are called training data set  If the class label is provided, this step is known as supervised learning, otherwise called unsupervised learning  The learned model may be represented in the form of classification rules, decision trees, Bayesian, mathematical formulae, etc. 5 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 6. How does classification work?  In the second step,  The learned model is used for classification  Estimate the predictive accuracy of the model using hold-out data set (a test set of class-labeled samples which are randomly selected and are independent of the training samples)  If the accuracy of the model were estimate based on the training data set -> the model tends to overfit the data  If the accuracy of the model is considered acceptable, the model can be used to classify future data tuples or objects for which the class label is unknown  In the experiment, there are three kinds of dataset, training data set, hold-out data set (or validation data set), and test data set 6 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 7. Issues Regarding Classification/Prediction  Comparing classification methods  The criteria to compare and evaluate classification and prediction methods  Accuracy: an ability of a given classifier to correctly predict the class label of new or unseen data  Speed: the computation costs involved in generating and using the given classifier or predictor  Robustness: an ability of the classifier or predictor to make correct predictions given noisy data or data with missing values  Scalability: an ability to construct the classifier or predictor efficiently given large amounts of data  Interpretability: the level of understanding and insight that is provided by the classifier or predictor – subjective and more difficult to assess 7 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 8. Decision Tree  A decision tree is a flow-chart-like tree structure,  each internal node denotes a test on an attribute,  each branch represents an outcome of the test  leaf node represent classes  Top-most node in a tree is the root node  Instead of using the complete set of features jointly to make a decision, different subsets of features are used at different levels of the tree during making a decision Age? The decision tree <=30 >40 represents the concept 31…40 buys_computer student? Credit_rating? yes no yes excellent fair no yes no yes 8 Classification – Decision Tree
  • 9. Decision Tree Induction  Normal procedure: greedy algorithm by top down in recursive divide-and-conquer fashion  First: attribute is selected for root node and branch is created for each possible attribute value  Then: the instances are split into subsets (one for each branch extending from the node)  Finally: procedure is repeated recursively for each branch, using only instances that reach the branch  Process stops if  All instances for a given node belong to the same class  No remaining attribute on which the samples may be further partitioned  majority vote is employed  No sample for the branch to test the attrbiute  majority vote is employed 9 Classification – Decision Tree
  • 10. Decision Tree Representation (An Example)  The decision tree (DT) of the weather example is: Outlook Temp. Humid. Windy Play Decision Tree sunny hot high false N Induction sunny hot high true N overcast hot high false Y rainy mild high false Y rainy cool normal false Y outlook rainy cool normal true N overcast cool normal true Y sunny rainy sunny mild high false N overcast sunny cool normal false Y humidity windy yes rainy mild normal false Y high normal false true sunny mild normal true Y overcast mild high true Y no yes yes no overcast hot normal false Y rainy mild high 10 true N Classification – Decision Tree
  • 11. An Example (Which attribute is the best?)  There are four possibilities for each split 11 Classification – Decision Tree
  • 12. Criterions for Attribute Selection  Which is the best attribute?  The one which will result in the smallest tree  Heuristic: choose the attribute that produces the “purest” nodes  Popular impurity criterion: information gain  Information gain increases with the average purity of the subsets that an attribute produces  Strategy: choose the attribute with the highest information gain is chosen as the test attribute for the current nodes 12 Classification – Decision Tree
  • 13. Computing “Information”  Information is measured in bits  Given a probability distribution, the information required to predict an event is the distribution‟s entropy  Entropy gives the information required in bits (this can involve fractions of bits!)  Information gain measures the goodness of split  Formula for computing expected information:  Let S be a set consisting of s data instances, the class label attribute has n distinct classes, Ci (for i = 1, …, n)  Let si be the number of instances in class Ci  The expected information or entropy is info([s1,s2,…,sn]) = entropy(s1/s, s2/s, sn/s) = - S pi(log2 pi) where pi is the probability that the instance belongs to class, pi = si/s  Formula for computing information gain:  Find an information gain of attribute A gain(A) = info. before splitting – info. after splitting 13 Classification – Decision Tree
  • 14. Expected Information for “Outlook”  “Outlook” = “sunny”: info([2,3]) = entropy(2/5,3/5) = -(2/5)log2(2/5) - (3/5)log2(3/5) = 0.971 bits “Outlook” = “overcast”: Outlook Temp. Humid. Windy Play  sunny hot high false N info([4,0]) = entropy(1,0) = -(1)log2(1) - (0)log2(0) sunny overcast hot hot high high true false N Y = 0 bits rainy mild high false Y  “Outlook” = “rainy”: rainy rainy cool cool normal normal false true Y N info([3,2]) = entropy(3/5,2/5) overcast cool normal true Y = - (3/5)log2(3/5) - (2/5)log2(2/5) sunny mild high false N sunny cool normal false Y = 0.971 bits rainy mild normal false Y  Expected information for attribute “Outlook”: sunny overcast mild mild normal high true true Y Y info([2,3],[4,0],[3,2]) overcast hot normal false Y = (5/14)info([2,3]) + (4/14)info([4,0]) + (5/14)info([3,2]) rainy mild high true N = [ (5/14)0.971 ] +[ (4/14)x0 ] +[ (5/14)x0.971 ] = 0.693 bits 14 Classification – Decision Tree
  • 15. Information Gain for “Outlook”  Information gain:  info. before splitting – info. after splitting gain(”Outlook”) = info([9,5]) - info([2,3],[4,0],[3,2]) = 0.940-0.693 = 0.247 bits  Information gain for attributes from weather data: gain(”Outlook”) = 0.247 bits gain(”Temperature”) = 0.029 bits gain(“Humidity”) = 0.152 bits gain(“Windy”) = 0.048 bits 15 Classification – Decision Tree
  • 16. An Example of Gain Criterion (Which attribute is the best?) Gain(outlook) = Gain(humidity) = info([9,5]) - info([9,5]) - info([2,3],[4,0],[3,2]) info([3,4],[6,1]) = 0.247 = 0.152 The best Gain(humidity) = info([9,5]) - Gain(outlook) = info([6,2],[3,3]) info([9,5]) - = 0.048 info([2,2],[4,2],[3,1]) = 0.029 16 Classification – Decision Tree
  • 17. Continuing to Split If “Outlook” = “sunny” gain(”Temperature”) = 0.571 bits gain(“Humidity”) = 0.971 bits gain(“Windy”) = 0.020 bits 17 Classification – Decision Tree
  • 18. The Final Decision Tree Note: not all leaves need to be pure; sometimes identical instances have different classes Splitting stops when data can‟t be split any further 18 Classification – Decision Tree
  • 19. Properties for a Purity Measure  Properties we require from a purity measure:  When node is pure, measure should be zero  When impurity is maximal (i. e. all classes equally likely), measure should be maximal  Measure should obey multistage property (i. e. decisions can be made in several stages): measure([2,3,4]) = measure([2,7]) + (7/9) measure([3,4])  Entropy is the only function that satisfies all three properties! 19 Classification – Decision Tree
  • 20. Some Properties for the Entropy  The multistage property: entropy(p,q,r) = entropy(p,q+r) + [(q+r)/(p+q+r)] × entropy(q, r)  Ex.: info(2,3,4) can be calculated as = {- (2/9)log2(2/9) + (7/9)log2(7/9)}–{7/9}*{[(3/7)log2(3/7) + (4/7)log2(4/7)} = - (2/9)log2(2/9) - (7/9) [ log2 (7/9) + (3/7)log2(3/7) + (4/7)log2(4/7) ] = - (2/9)log2(2/9) - (7/9) [ (3/7)log2(7/9) + (4/7)log2(7/9) +(3/7)log2(3/7) + (4/7)log2(4/7) ] = - (2/9)log2(2/9) - (7/9) [ (3/7)log2(7/9) + (3/7)log2(3/7) +(4/7)log2(7/9) + (4/7)log2(4/7) ] = - (2/9)log2(2/9) - (7/9) [ (3/7)log2(7/9 x 3/7) + (4/7)log2(7/9 x 4/7) ] = - (2/9)log2(2/9) - (7/9) [ (3/7)log2(3/9) + (4/7)log2(4/9) ] = - (2/9)log2(2/9) - (3/9)log2(3/9) - (4/9)log2(4/9) 20 Classification – Decision Tree
  • 21. A Problem: Highly-Branching Attributes  Problematic: attributes with a large number of values (extreme case: ID code)  Subsets are more likely to be pure if there is a large number of values  Information gain is biased towards choosing attributes with a large number of values  This may result in overfitting (selection of an attribute that is non-optimal for prediction) and fragmentation 21 Classification – Decision Tree
  • 22. Example: Highly-Branching Attributes ID Outlook Temp. Humid. Windy Play A sunny hot high false N ID B sunny hot high true N A N C overcast hot high false Y B M D rainy mild high false Y no yes yes no E rainy cool normal false Y F rainy cool normal true N Entropy Split G overcast cool normal true Y info(ID) H sunny mild high false N = info([0,1],[0,1], I sunny cool normal false Y [1,0],…,[0,1]) J rainy mild normal false Y = 0 bits K sunny mild normal true Y gain(ID) = 0.940 (max.) L overcast mild high true Y M overcast hot normal false Y N rainy mild high true N 22 Classification – Decision Tree
  • 23. Modification: The Gain Ratio As a Split Info.  Gain ratio: a modification of the information gain that reduces its bias  Gain ratio takes number and size of branches into account when choosing an attribute  It corrects the information gain by taking the intrinsic information of a split into account  Intrinsic information: entropy of distribution of instances into branches (i.e. how much info do we need to tell which branch an instance belongs to) 23 Classification – Decision Tree
  • 24. Computing the Gain Ratio  Example: intrinsic information (split info) for ID code info([1,1,…,1] = 14*( (-1/14)log(1/14) ) =3.807  Value of attribute decreases as intrinsic information gets larger  Definition of gain ratio: gain_ratio(“Attribute”) = gain(“Attribute”) intrinsic_info (“Attribute”)  Example: gain_ratio(“ID”) = gain(“ID”) = 0.970 bits intrinsic_info (“ID”) 3.807 bits = 0.246 24 Classification – Decision Tree
  • 25. Gain Ratio for Weather Data 25 Classification – Decision Tree
  • 26. Gain Ratio for Weather Data(Discussion)  “Outlook” still comes out top  However: “ID” has greater gain ratio  Standard fix: ad hoc test to prevent splitting on that type of attribute  Problem with gain ratio: it may overcompensate  May choose an attribute just because its intrinsic information is very low  Standard fix: only consider attributes with greater than average information gain 26 Classification – Decision Tree
  • 27. Avoiding Overfitting the Data  The naïve DT algorithm grows each branch of the tree just deeply enough to perfectly classify the training examples.  This algorithm may produce trees that overfit the training examples but do not work well for general cases.  Reason: the training set may has some noises or it is too small to produce a representative sample of the true target tree (function). 27 Classification – Decision Tree
  • 28. Avoid Overfitting: Pruning  Pruning simplifies a decision tree to prevent overfitting to noise in the data  Two main pruning strategies:  1. Prepruning: stops growing a tree when no statistically significant association between any attribute and the class at a particular node. Most popular test: chi-squared test, only statistically significant attributes where allowed to be selected by information gain procedure 2. Postpruning: takes a fully-grown decision tree and discards unreliable parts by two main pruning operations, i.e., subtree replacement and subtree raising with some possible strategies, e.g., error estimation, significance testing, MDL principle.  Prepruning is preferred in practice because of early stopping 28 Classification – Decision Tree
  • 29. Subtree Replacement  Bottom-up: tree is considered for replacement once all its subtrees have been considered 29 Classification – Decision Tree
  • 30. Subtree Raising  Deletes node and redistributes instances  Slower than subtree replacement (Worthwhile?) 30 Classification – Decision Tree
  • 31. Tree to Rule vs. Rule to Tree Tree outlook Rule If outlook=sunny & humidity=high then class=no sunny rainy If outlook=sunny & humidity=normal then class=yes overcast humidity windy If outlook=overcast then class=yes yes If outlook=rainy & windy=false then class=yes If outlook=rainy & windy=true then class=no high normal false true no yes yes no Rule Tree ? If outlook=sunny & humidity=high then class=no If humidity=normal then class=yes If outlook=overcast then class=yes If outlook=rainy & windy=true then class=no outlook=rainy & windy=true & humidity=normal  ? Question: outlook=rainy & windy=false & humidity=high  ? 31 Classification Rules
  • 32. Classification Rule: Algorithms  Two main algorithms are:  Inferring Rudimentary rules  1R: 1-level decision tree  Covering Algorithms:  Algorithm to construct the rules  Pruning Rules & Computing Significance  Hypergeometric Distribution vs. Binomial Distribution  Incremental Reduce-Error Pruning 32 Classification Rules
  • 33. (Holte, 93) Inferring Rudimentary Rules (1R rule)  1R learns a 1-level decision tree  Generate a set of rules that all test on one particular attribute  Focus on each attribute  Pseudo-code • For each attribute, • For each value of the attribute, make a rule as follows: • count how often each class appears • find the most frequent class • make the rule assign that class to this attribute-value • Calculate the error rate of the rules • Choose the rules with the smallest error rate  Note: “missing” can be treated as a separate attribute value  1R’s simple rules performed not much worse than much more complex decision trees. 33 Classification Rules
  • 34. An Example: Evaluating the Weather Attributes (Nominal, Ordinal) Outlook Temp. Humidity Windy Play Attribute Rule Error Total Error sunny hot high false no Outlook O = sunny  no 2/5 4/14 (O) O = overcast  yes 0/4 sunny hot high true no O = rainy  yes 2/5 overcast hot high false yes Temp. T = hot  no 2/4 5/14 rainy mild high false yes (T) T = mild  yes 2/6 rainy cool normal false yes T = cool  yes 1/4 Humidity H = high  no 3/7 4/14 rainy cool normal true no (H) H = normal  yes 1/7 overcast cool normal true yes Windy W = false  yes 2/8 5/14 sunny mild high false no (W) W = true  no 3/6 sunny cool normal false yes rainy mild normal false yes 1R chooses the attribute that sunny mild normal true yes produces rules with the smallest overcast mild high true yes number of errors, i.e., rule sets of attribute “Outlook” or overcast hot normal false yes “Humidity” rainy mild high true no 34 Classification Rules
  • 35. An Example: Evaluating the Weather Attributes (Numeric) Outlook Temp. Humidity Windy Play Attribute Rule Error Total sunny 85 85 false no Error sunny 80 90 true no Outlook O = sunny  no 2/5 4/14 (O) O = overcast  yes 0/4 overcast 83 86 false yes O = rainy  yes 2/5 rainy 70 96 false yes Temp. T <= 77.5  yes 3/10 5/14 rainy 68 80 false yes (T) T > 77.5  no 2/4 rainy 65 70 true no Humidity H <= 82.5  yes 1/7 3/14 overcast 64 65 true yes (H) 82.5<H<=95.5  no 2/6 H > 95.5  yes 0/1 sunny 72 95 false no Windy W = false  yes 2/8 5/14 sunny 69 70 false yes (W) W = true  no 3/6 rainy 75 80 false yes sunny 75 70 true yes 1R chooses the attribute that overcast 72 90 true yes produces rules with the smallest overcast 81 75 false yes number of errors, i.e., rule set of attribute “Humidity” rainy 71 91 true no 35 Classification Rules
  • 36. Dealing with Numeric Attributes  Numeric attributes are discretized: the range of the attribute is divided into a set of intervals  Instances are sorted according to attribute’s values  Breakpoints are placed where the (majority) class changes (so that the total error is minimized)  Example: Temperature from weather data Left-to-right 64 65 68 69 70 71 72 72 75 75 80 81 83 85 Y | N | Y Y Y | N N Y | Y Y | N | Y Y | N min=3 64 65 68 69 70 71 72 72 75 75 80 81 83 85 Y N Y Y Y | N N Y Y Y | N Y Y N Merge 64 65 68 69 70 71 72 72 75 75 80 81 83 85 same Y N Y Y Y N N Y Y Y | N Y Y N category 36 Classification Rules
  • 37. Separate-and-conquer: selects the test that maximizes the number of covered positive examples Covering Algorithm and minimizes the number of negative examples that pass the test. It usually does not pay any attention to the examples that do not pass the test.  Separate-and-conquer algorithm Divide-and-conquer: optimize for all outcomes of  Focus on each class in turn the test.  Seek a way to covering all instances in the class  More rules could be added for perfect rule set  Comparing to decision tree (DT):  Decision tree  Divide-and-conquer  Focus on all classes at each step  Seek an attribute to split on that best separates the classes  DT can be converted into a rule set  Straightforward conversion: rule set overly complex  More effective conversions are not trivial  In multiclass situations, covering algorithm concentrates on one class at a time whereas DT learner takes all classes into account 37 Classification Rules
  • 38. Constructing Classification Rule (An Example) y b a a y b a a y b a a a a a b b a b b b a b b b a b a a 2.6 a b b b b b b bb bb bb x 1.2 x 1.2 x Instance space Classification Rules Rule so far If x<=1.2 then class = b If x> 1.2 then class = b If x> 1.2 & y<=2.6 then class = b x > 1.2 n y b y > 2.6 Rule after adding new item n y Decision Tree b ? More rules could be added for “perfect” rule set 38
  • 39. A Simple Covering Algorithm  Generates a rule by adding tests that maximize rule’s accuracy, even each new test reduces the rule’s coverage  Similar to situation in decision trees: problem of selecting an attribute to split  Decision tree inducer maximizes overall purity.  Covering algorithm maximizes rule accuracy.  Goal: maximizing accuracy  t: total number of instances covered by rule  p: positive examples of the class covered by rule  t-p: number of errors made by rule  One option: select test that maximizes the ratio p/t  We are finished when p/t = 1 or the set of instances cannot be split any further. 39 Classification Rules
  • 40. An Example: Contact Lenses Data age Spectacle astigmati Tear prod. Recom. Age Spectacle astigma Tear prod. Recom. prescription sm rate lenses prescription tism rate lenses young myope no reduced none presbyopic myope no reduced none young myope no normal soft presbyopic myope no normal none young myope yes reduced none presbyopic myope yes reduced none young myope yes normal hard presbyopic myope yes normal hard young hypermyope no reduced none presbyopic hypermyope no reduced none young hypermyope no normal soft presbyopic hypermyope no normal soft young hypermyope yes reduced none presbyopic hypermyope yes reduced none young hypermyope yes normal hard presbyopic hypermyope yes normal none pre-presbyopic myope no reduced none pre-presbyopic myope no normal soft pre-presbyopic myope yes reduced none pre-presbyopic myope yes normal hard pre-presbyopic hypermyope no reduced none pre-presbyopic hypermyope no normal soft pre-presbyopic hypermyope yes reduced none First try to find a rule for “hard” pre-presbyopic hypermyope yes normal none 40 Classification Rules
  • 41. An Example: Contact Lenses Data (Finding a good choice)  Rule we seek: If ? then recommendation = hard  Possible tests: Age = Young 2/8 Age = Pre- presbyopic 1/8 Age = Presbyopic 1/8 Spectacle prescription = Myope 3/12 Spectacle prescription = Hypermetrope 1/12 Astigmatism = no 0/12 Astigmatism = yes 4/12 Tear production rate = Reduced 0/12 Tear production rate = Normal 4/12 OR 41 Classification Rules
  • 42. Modified Rule and Resulting Data  Rule with best test added: If astigmatics = yes then recommendation = hard age Spectacle astigmati Tear prod. Recom. Age Spectacle astigma Tear prod. Recom. prescription sm rate lenses prescription tism rate lenses young myope no reduced none presbyopic myope no reduced none young myope no normal soft presbyopic myope no normal none young myope yes reduced none presbyopic myope yes reduced none young myope yes normal hard presbyopic myope yes normal hard young hypermyope no reduced none presbyopic hypermyope no reduced none young hypermyope no normal soft presbyopic hypermyope no normal soft young hypermyope yes reduced none presbyopic hypermyope yes reduced none young hypermyope yes normal hard presbyopic hypermyope yes normal none pre-presbyopic myope no reduced none pre-presbyopic myope no normal soft • The underlined rows match with the pre-presbyopic myope yes reduced none rule. pre-presbyopic myope yes normal hard • Anyway, we need to refine the rule pre-presbyopic hypermyope no reduced none since they are not all correct, pre-presbyopic hypermyope no normal soft according to the rule. pre-presbyopic hypermyope yes reduced none 42 pre-presbyopic hypermyope yes normal none Classification Rules
  • 43. Further Refinement  Current State: If astigmatism = yes and ? then recommendation = hard  Possible tests: Age = Young 2/4 Age = Pre- presbyopic 1/4 Age = Presbyopic 1/4 Spectacle prescription = Myope 3/6 Spectacle prescription = Hypermetrope 1/6 Tear production rate = Reduced 0/6 Tear production rate = Normal 4/6 43 Classification Rules
  • 44. Modified Rule and Resulting Data  Rule with best test added: If astigmatics = yes and tear prod. rate = normal then recommendation = hard age Spectacle astigmati Tear prod. Recom. prescription sm rate lenses Age Spectacle astigma Tear prod. Recom. young myope no reduced none tism prescription rate lenses young myope no normal soft presbyopic myope no reduced none young myope yes reduced none presbyopic myope no normal none young myope yes normal hard presbyopic myope yes reduced none young hypermyope no reduced none presbyopic myope yes normal hard young hypermyope no normal soft presbyopic hypermyope no reduced none young hypermyope yes reduced none presbyopic hypermyope no normal soft young hypermyope yes normal hard presbyopic hypermyope yes reduced none pre-presbyopic myope no reduced none presbyopic hypermyope yes normal none pre-presbyopic myope no normal soft pre-presbyopic myope yes reduced none • The underlined rows match with pre-presbyopic myope yes normal hard the rule. pre-presbyopic hypermyope no reduced none • Anyway, we need to refine the rule pre-presbyopic hypermyope no normal soft since they are not all correct, pre-presbyopic hypermyope yes reduced none 44 pre-presbyopic hypermyope yes normal none according to the rule. Classification Rules: Covering Algorithm
  • 45. Further Refinement  Current State: If astigmatism = yes and tear prod. rate = normal and ? then recommendation = hard  Possible tests: Age = Young 2/2 Age = Pre- presbyopic 1/2 Age = Presbyopic 1/2 Spectacle prescription = Myope 3/3 Spectacle prescription = Hypermetrope 1/3  Tie between the first and the fourth test  We choose the one with greater coverage 45 Classification Rules
  • 46. Modified Rule and Resulting Data  Final rule with best test added: If astigmatics = yes and tear prod.rate = normal and spectacle prescription = myope then recommendation = hard age Spectacle astigmati Tear prod. Recom. prescription sm rate lenses Age Spectacle astigma Tear prod. Recom. young myope no reduced none tism prescription rate lenses young myope no normal soft presbyopic myope no reduced none young myope yes reduced none presbyopic myope no normal none young myope yes normal hard presbyopic myope yes reduced none young hypermyope no reduced none presbyopic myope yes normal hard young hypermyope no normal soft presbyopic hypermyope no reduced none young hypermyope yes reduced none presbyopic hypermyope no normal soft young hypermyope yes normal hard presbyopic hypermyope yes reduced none pre-presbyopic myope no reduced none presbyopic hypermyope yes normal none pre-presbyopic myope no normal soft pre-presbyopic myope yes reduced none • The blue rows match with the rule. pre-presbyopic myope yes normal hard • All three rows are „hard‟. pre-presbyopic hypermyope no reduced none • No need to refine the rule since the pre-presbyopic hypermyope no normal soft pre-presbyopic hypermyope yes reduced none rule becomes perfect. ITS423: Data Warehouses and Data 46 Classification Rules pre-presbyopic hypermyope yes normal none Mining
  • 47. Finding More Rules  Second rule for recommending “hard lenses”: (built from instances not covered by first rule) If age = young and astigmatism = yes and tear production rate = normal then recommendation = hard  These astigmatics = yes & tear.prod.rate = lenses”: spectacle.prescr = myope (1) If two rules cover all “hard normal & then recommendation = hard (2) If age = young and astigmatism = yes and tear production rate = normal then recommendation = hard  Process is repeated with other two classes, that is “soft lenses” and “none”. 47 Classification Rules
  • 48. Pseudo-code for PRISM Algorithm For each class C • Initialize E to the instance set • While E contains instances in class C • Create a rule R with an empty left-hand-side that predicts class C • Until R is perfect (or there are no more attributes to use) do • For each attribute A not mentioned in R, and each value v, • Consider adding the condition A = v to the left-hand side of R • Select A and v to maximize the accuracy p/t (break ties by choosing the condition with the largest p) • Add A = v to R • Remove the instances covered by R from E 48 Classification Rules
  • 49. Order Dependency among Rules  PRISM without outerloop generates a decision list for one class  Subsequent rules are designed for rules that are not covered by previous rules  Here, order does not matter because all rules predict the same class  Outer loop considers all classes separately  No order dependence implied  Two problems are  overlapping rules  default rule required 49 Classification Rules
  • 50. Separate-and-Conquer  Methods like PRISM (for dealing with one class) are separate-and- conquer algorithms:  First, a rule is identified  Then, all instances covered by the rule are separated out  Finally, the remaining instances are “conquered”  Difference to divide-and-conquer methods:  Subset covered by rule doesn’t need to be explored any further  Variety in separate-and-conquer approach.  Search method (e. g. greedy, beam search, ...)  Test selection criteria (e. g. accuracy, ...)  Pruning method (e. g. MDL, hold-out set, ...)  Stopping criterion (e. g. minimum accuracy)  Post- processing step  Also: Decision list vs. one rule set for each class 50 Classification Rules
  • 51. Good Rules and Bad Rules (overview)  Sometimes it is better not to generate perfect rules that guarantee to give the correct classification on all instances in order to avoiding overfitting.  How do we decide which rules are worthwhile?  How do we tell when it becomes counterproductive to continue adding terms to a rule to exclude a few pecky instances of the wrong type?  Two main strategies of pruning rules  Global pruning (post-pruning) Create all perfect rules then prune  Incremental pruning (pre-pruning)  Three pruning criteria Prune a rule when generating  MDL principle (Minimum Description Length) Rule size + Exception  Statistical significance  INDUCT  Error on hold-out set (reduced-error pruning) 51 Classification Rules
  • 52. Hypergeometric Distribution The dataset contains T examples The rule selects t examples The class contains P examples P T-P The p examples out of t p t-p examples selected by the rule are correctly covered T Hypergeometric Distribution t 52 Classification Rules
  • 53. Computing Significance  We want the probability that a random rule does at least as well (statistical significance of rule):  P  T  P    min(t , P )    i  t  i  Or m( R )     Ci  T  PCt i min(t , P ) P m( R)   T  T i p Ct i p   t    p p! Here,     q  q!( p  q)!   53 Classification Rules
  • 54. Good/Bad Rules by Statistical significance (An Example) “Reduced probability” 1 If astigmatism = yes then recommendation = hard means better success fraction = 4/12 0.047  0.0014 P = p = 4, T = 24, t=12 no information success fraction = 4/24 4 24−4 1∗ 20 20! 4 12−4 8 probability of 4/24  4/12 = 0.047 24 = 24! = 8!∗12! 24! 12 12!∗12! 12!∗12! 20! ∗ 12! = = 0.047 If astigmatism = yes and 8! ∗ 24! 2 tear production rate = normal then recommendation = hard success fraction = 4/6 The Best Rule no information success fraction = 4/24 probability of 4/24  4/6 = 0.0014 3 If astigmatism = yes and tear prod. rate = normal and age = young then recommendation = hard 0.0014  0.022 success fraction = 2/2 “Increased no information success fraction = 4/24  P  T  P  probability”    i  t  i  probability of 4/24  2/2 = 0.022  min(t , P ) m( R)      means worse i p T  54   t  
  • 55. Good/Bad Rules by Statistical significance (Another Example) If astigmatism = yes and tear production rate = normal 4 then recommendation = none success fraction = 2/6 no information success fraction = 15/24 Bad Rule probability of 15/24  2/6 = 0.985 High Probability 5 If astigmatism = no and tear production rate = normal then recommendation = soft success fraction = 5/6 no information success fraction = 5/24 Good Rule probability of 5/24  5/6 = 0.0001 Low Probability 6 If tear production rate = reduced then recommendation = none success fraction = 12/12 no information success fraction = 15/24 probability of 15/24  12/12 = 0.0017 55 Classification Rules
  • 56. The Binomial Distribution  Approximation: can use sampling with replacement instead of sampling without replacement Dataset contains T examples Rule selects t examples Class contains P examples p examples are correctly covered t i  t  P   P  min(t , P ) i m( R)      1   i  T i p     T  56 Classification Rules
  • 57. Pruning Strategies  For better estimation, a rule should be evaluated on data not used for training.  This requires a growing set and a pruning set  Two options are  Reduced-error pruning for rules builds a full unpruned rule set and simplifies it subsequently  Incremental reduced-error pruning simplifies a rule immediately after it has been built. 57 Classification Rules
  • 58. INDUCT (Incremental Pruning Algorithm) Initialize E to the instance set Until E is empty do For each class C for which E contains an instance Use basic covering algorithm to create best perfect rule for C Calculate significance m(R) for rule and significance m(R-) for rule with final condition omitted If (m(R-) < m(R)), prune rule and repeat previous step From the rules for the different classes, select the most significant one (i.e. the one with smallest m(R)) Print the rule Remove the instances covered by rule from E Continue INDUCT’s significance computation for a rule: • Probability of completely random rule with same coverage performing at least as well. • Random rule R selects t cases at random from the dataset • We want to know how likely it is that p of these belong to the correct class? • This probability is given by the hypergeometric distribution 58 Classification Rules
  • 59. Example: Classification task is to predict whether a customer will buy a computer RID age income student Credit_rating Class:buys_computer 1 youth High No Fair No 2 youth High No Excellent No 3 middle_age High No Fair Yes 4 senior Medium No Fair Yes 5 senior Low Yes Fair Yes 6 senior Low Yes Excellent No 7 middle_age Low Yes Excellent Yes 8 youth Medium No Fair No 9 youth Low Yes Fair Yes 10 senior Medium Yes Fair Yes 11 youth Medium Yes Excellent Yes 12 middle_age Medium No Excellent Yes 13 middle_age High Yes Fair Yes 14 senior medium no Excellent No 59 Data Warehousing and Data Mining by Kritsada Sriphaew