SlideShare una empresa de Scribd logo
1 de 148
Classification and
   Prediction
- The Course

   DS                             OLAP



   DS                   DP   DW    DM



                                         Association

   DS                                    Classification


DS = Data source                         Clustering
DW = Data warehouse
DM = Data Mining
DP = Staging Database
Chapter Objectives
   Learn basic techniques for data classification
    and prediction.
   Realize the difference between the following
    classifications of data:
    – supervised classification
    – prediction
    – unsupervised classification
Chapter Outline
   What is classification and prediction of data?
   How do we classify data by decision tree induction?
   What are neural networks and how can they classify?
   What is Bayesian classification?
   Are there other classification techniques?
   How do we predict continuous values?
What is Classification?
   The goal of data classification is to organize and
    categorize data in distinct classes.
    – A model is first created based on the data
      distribution.
    – The model is then used to classify new data.
    – Given the model, a class can be predicted for new
      data.

   Classification = prediction for discrete and nominal
    values
What is Prediction?
   The goal of prediction is to forecast or deduce the value of an
    attribute based on values of other attributes.
     – A model is first created based on the data distribution.
     – The model is then used to predict future or unknown values


   In Data Mining
     – If forecasting discrete value  Classification
     – If forecasting continuous value  Prediction
Supervised and Unsupervised
   Supervised Classification = Classification
    – We know the class labels and the number of
      classes


   Unsupervised Classification = Clustering
    – We do not know the class labels and may not
      know the number of classes
Preparing Data Before
                   Classification
   Data transformation:
    – Discretization of continuous data
    – Normalization to [-1..1] or [0..1]
   Data Cleaning:
    – Smoothing to reduce noise
   Relevance Analysis:
    – Feature selection to eliminate irrelevant attributes
Application
   Credit approval
   Target marketing
   Medical diagnosis
   Defective parts identification in manufacturing
   Crime zoning
   Treatment effectiveness analysis
   Etc
Classification is a 3-step process
   1. Model construction (Learning):
        • Each tuple is assumed to belong to a predefined class, as
          determined by one of the attributes, called the class label.
        • The set of all tuples used for construction of the model is
          called training set.


    – The model is represented in the following forms:
        • Classification rules, (IF-THEN statements),
        • Decision tree
        • Mathematical formulae
1. Classification Process (Learning)
Name     Income   Age        Credit
Samir    Low      <30
                             rating   Classification Method
                              bad
Ahmed    Medium   [30...40
                  ]          good
Salah    High     <30        good
Ali      Medium   >40        good
                                       Classification Model
Sami     Low      [30..40]   good
Emad     Medium   <30         bad
                                         IF Income = ‘High’
      Training Data          class       OR Age > 30
                                         THEN Class = ‘Good
                                         OR
                                         Decision Tree
                                         OR
                                         Mathematical For
Classification is a 3-step process
2. Model Evaluation (Accuracy):
   – Estimate accuracy rate of the model based on a test set.
   – The known label of test sample is compared with the
     classified result from the model.
   – Accuracy rate is the percentage of test set samples that are
     correctly classified by the model.
   – Test set is independent of training set otherwise over-fitting
     will occur
2. Classification Process (Accuracy
                     Evaluation)

                                             Classification Model




Name Income Age            Credit rating   Model
Naser Low        <30           Bad         Bad
                                                         Accuracy
Lutfi    Medium <30            Bad         good            75%
Adel     High    >40           good        good
Fahd     Medium [30..40]       good        good


                                   class
Classification is a three-step process

3. Model Use (Classification):
   – The model is used to classify unseen objects.
      • Give a class label to a new tuple
      • Predict the value of an actual attribute
3. Classification Process (Use)

             Classification Model




      Name    Income Age    Credit rating

      Adham Low       <30           ?
Classification Methods                  Classification Method


    Decision Tree Induction
    Neural Networks
    Bayesian Classification
    Association-Based Classification
    K-Nearest Neighbour
    Case-Based Reasoning
    Genetic Algorithms
    Rough Set Theory
    Fuzzy Sets
    Etc.
Evaluating Classification Methods
   Predictive accuracy
     – Ability of the model to correctly predict the class label
   Speed and scalability
     – Time to construct the model
     – Time to use the model
   Robustness
     – Handling noise and missing values
   Scalability
     – Efficiency in large databases (not memory resident data)
   Interpretability:
     – The level of understanding and insight provided by the
       model
Chapter Outline
   What is classification and prediction of data?
   How do we classify data by decision tree induction ?

   What are neural networks and how can they
    classify?
   What is Bayesian classification?
   Are there other classification techniques?
   How do we predict continuous values?
Decision Tree
What is a Decision Tree?
   A decision tree is a flow-chart-like tree structure.
     – Internal node denotes a test on an attribute
     – Branch represents an outcome of the test
        • All tuples in branch have the same value for the tested
          attribute.

   Leaf node represents class label or class label
    distribution
Sample Decision Tree
                     Excellent customers
                     Fair customers

    80

                                  Income
                              < 6K               >= 6K

Age 50                       No            YES




    20
     2000    6000    10000
            Income
Sample Decision Tree

    80
                                       Income
                                 <6k                >=6k

                            NO                    Age
Age 50                                                       >=50
                                            <50
                                          NO               Yes


    20
    2000    6000    10000

           Income
Sample Decision Tree
Outlook    Temp    Humidity   Windy       Play?
sunny      hot     high       FALSE       No
sunny      hot     high       TRUE        No
overcast   hot     high       FALSE       Yes
rainy      mild    high       FALSE       Yes
rainy      cool    normal     FALSE       Yes
rainy      cool    Normal     TRUE        No
overcast   cool    Normal     TRUE        Yes
sunny      mild    High       FALSE       No
sunny      cool    Normal     FALSE       Yes
rainy      mild    Normal     FALSE       Yes
sunny      mild    normal     TRUE        Yes
overcast   mild    High       TRUE        Yes
overcast   hot     Normal     FALSE       Yes
rainy      mild    high       TRUE        No



 http://www-lmmb.ncifcrf.gov/~toms/paper/primer/latex/index.html
 http://directory.google.com/Top/Science/Math/Applications/Information_Theory/Papers/
Decision-Tree Classification Methods

   The basic top-down decision tree generation
    approach usually consists of two phases:
    1. Tree construction
       • At the start, all the training examples are at the root.
       • Partition examples are recursively based on selected
          attributes.

    2. Tree pruning
       • Aiming at removing tree branches that may reflect noise
          in the training data and lead to errors when classifying
          test data  improve classification accuracy
How to Specify Test Condition?
   Depends on attribute types
     – Nominal
     – Ordinal
     – Continuous

   Depends on number of ways to split
     – 2-way split
     – Multi-way split
Splitting Based on Nominal Attributes

   Multi-way split: Use as many partitions as distinct
    values.

                              CarType
                    Family              Luxury
                             Sports



   Binary split: Divides values into two subsets.
    Need to find optimal partitioning.

                CarType                                     CarType
     {Sports,                           OR       {Family,
     Luxury}              {Family}                Luxury}             {Sports}
Splitting Based on Ordinal Attributes
   Multi-way split: Use as many partitions as distinct
    values.
                              Size
                  Small               Large
                       Medium

   Binary split: Divides values into two subsets.
    Need to find optimal partitioning.
                                                            Size
           Size                               {Medium,
{Small,
                    {Large}
                                     OR         Large}             {Small}
Medium}


                                                         Size
                                          {Small,
   What about this split?                Large}                {Medium}
Splitting Based on Continuous Attributes

   Different ways of handling
     – Discretization to form an ordinal categorical
       attribute
         • Static – discretize once at the beginning
         • Dynamic – ranges can be found by equal
           interval bucketing, equal frequency bucketing
           (percentiles), or clustering.

    – Binary Decision: (A < v) or (A ≥ v)
       • consider all possible splits and finds the best cut
       • can be more compute intensive
Splitting Based on Continuous Attributes
Tree Induction
   Greedy strategy.
    – Split the records based on an attribute test that
      optimizes certain criterion.

   Issues
     – Determine how to split the records
        • How to specify the attribute test condition?
        • How to determine the best split?
     – Determine when to stop splitting
How to determine the Best Split
        Good customers                  fair customers


                         Customers




        Income                            Age
 <10k          >=10k            young              old
How to determine the Best Split
   Greedy approach:
    – Nodes with homogeneous class distribution are
      preferred

   Need a measure of node impurity:



    High degree       Low degree            pure
    of impurity       of impurity
    50% red            75% red            100% red
    50% green          25% green          0%  green
Measures of Node Impurity

   Information gain
    – Uses Entropy

   Gain Ratio
    – Uses Information
      Gain and Splitinfo

   Gini Index
    – Used only for
      binary splits
Algorithm for Decision Tree Induction
   Basic algorithm (a greedy algorithm)
     – Tree is constructed in a top-down recursive divide-and-conquer
       manner
     – At start, all the training examples are at the root
     – Attributes are categorical (if continuous-valued, they are discretized
       in advance)
     – Examples are partitioned recursively based on selected attributes
     – Test attributes are selected on the basis of a heuristic or statistical
       measure (e.g., information gain)
   Conditions for stopping partitioning
     – All samples for a given node belong to the same class
     – There are no remaining attributes for further partitioning – majority
       voting is employed for classifying the leaf
     – There are no samples left
Classification Algorithms
 ID3
  – Uses information gain
 C4.5
  – Uses Gain Ratio

 CART
  – Uses Gini
Entropy: Used by ID3




             Entropy(S) = - p log2 p - q log2 q



  Entropy measures the impurity of S
  S is a set of examples
  p is the proportion of positive examples
  q is the proportion of negative examples
ID3
outlook    temperature   humidity   windy   play                        play
sunny      hot           high       FALSE   no
sunny      hot           high       TRUE    no                          don’t play
overcast   hot           high       FALSE   yes
rainy      mild          high       FALSE   yes    pno = 5/14
rainy      cool          normal     FALSE   yes
rainy      cool          normal     TRUE    no
overcast   cool          normal     TRUE    yes
sunny      mild          high       FALSE   no
sunny      cool          normal     FALSE   yes
rainy      mild          normal     FALSE   yes
sunny      mild          normal     TRUE    yes                                      pyes = 9/14
overcast   mild          high       TRUE    yes
overcast   hot           normal     FALSE   yes
rainy      mild          high       TRUE    no

                    Impurity         =        - pyes log2 pyes - pno log2 pno
                                     =       - 9/14 log2 9/14 - 5/14 log2 5/14
                                     =       0.94 bits
ID3                                                                         0.94 bits
                                                                                                                  play
                                                                                                                  don’t play
             al play
         xim tion 2
                       don't play                                                       play don't play
                                                      play don't play                                                      play don't play
     ma ma
       sunny               3
                                              high     3       4
                                                                                hot      2       2
                                                                                                                   FALSE     6      2
        or
      overcast    4        0                                                    mild     4       2
     infrainy ain 3
            g              2
                                             normal    6       1
                                                                                cool     3       1
                                                                                                                   TRUE      3      3


             outlook                            humidity                       temperature                             windy

     sunny         overcast         rainy         high           normal      hot          mild            cool        false            true




    amount of information required to specify class of an example given that it reaches node
0.97 bits     0.0 bits       0.97 bits      0.98 bits       0.59 bits     1.0 bits     0.92 bits      0.81 bits   0.81 bits     1.0 bits
* 5/14        * 4/14         * 5/14         * 7/14          * 7/14        * 4/14       * 6/14         * 4/14      * 8/14        * 6/14


           +                                      +                                   +                                  +
      = 0.69 bits                            = 0.79 bits                         = 0.91 bits                        = 0.89 bits
     gain: 0.25 bits                        gain: 0.15 bits                     gain: 0.03 bits                   gain: 0.05 bits
ID3                                         outlook                                            play
                                                                                               don’t play
                                    sunny         overcast    rainy

             0.97 bits                                                outlook
                                                                      sunny
                                                                                 temperature
                                                                                 hot
                                                                                               humidity
                                                                                               high
                                                                                                          windy
                                                                                                          FALSE
                                                                                                                  play
                                                                                                                  no
                                                                      sunny      hot           high       TRUE    no
                                                                      sunny      mild          high       FALSE   no
                                                                      sunny      cool          normal     FALSE   yes
             al
          xim tion                                                    sunny      mild          normal     TRUE    yes
       ma ma
      humidity
          or
       inf gain                      temperature                                windy

      high           normal       hot           mild         cool               false           true




 0.0 bits    0.0 bits         0.0 bits      1.0 bits    0.0 bits        0.92 bits         1.0 bits
 * 3/5       * 2/5            * 2/5         * 2/5       * 1/5           * 3/5             * 2/5


        +                                 +                                   +
   = 0.0 bits                        = 0.40 bits                         = 0.95 bits
 gain: 0.97 bits                   gain: 0.57 bits                     gain: 0.02 bits
ID3                                    outlook
                                                                                       play
                                                                                       don’t play
                                                                   outlook   temperature   humidity    windy     play
       sunny             overcast                       rainy      rainy     mild          high        FALSE     yes
                                                                   rainy     cool          normal      FALSE     yes
                                    0.97 bits                      rainy
                                                                   rainy
                                                                             cool
                                                                             mild
                                                                                           normal
                                                                                           normal
                                                                                                       TRUE
                                                                                                       FALSE
                                                                                                                 no
                                                                                                                 yes
                                                                   rainy     mild          high        TRUE      no
humidity
                      humidity                    temperature                                  windy
high     normal

                       high          normal       hot           mild         cool              false           true

                                              ∅
                  1.0 bits     0.92 bits                   0.92 bits     1.0 bits      0.0 bits           0.0 bits
                  *2/5         * 3/5                       * 3/5         * 2/5         * 3/5              * 2/5


                         +                               +                                    +
                    = 0.95 bits                     = 0.95 bits                          = 0.0 bits
                  gain: 0.02 bits                 gain: 0.02 bits                      gain: 0.97 bits
ID3
outlook    temperature    humidity   windy     play
sunny      hot            high       FALSE     no
sunny      hot            high       TRUE      no
overcast   hot            high       FALSE     yes
rainy      mild           high       FALSE     yes
rainy      cool           normal     FALSE     yes
rainy      cool           normal     TRUE      no
overcast   cool           normal     TRUE      yes
sunny      mild           high       FALSE     no
sunny      cool           normal     FALSE     yes
rainy      mild           normal     FALSE     yes                           play
sunny      mild           normal     TRUE      yes
overcast
overcast
           mild
           hot
                          high
                          normal
                                     TRUE
                                     FALSE
                                               yes
                                               yes    outlook                don’t play
rainy      mild           high       TRUE      no

                    sunny                                 overcast              rainy

                                                       Yes

                         humidity                                            windy
           high
                                             normal                  false                true
           No                           Yes                          Yes                No
C4.5
   Information gain measure is biased towards attributes with a large
    number of values
   C4.5 (a successor of ID3) uses gain ratio to overcome the problem
    (normalization to information gain)
     – GainRatio(A) = Gain(A)/SplitInfo(A)
                                v      | Dj |               | Dj |
             SplitInfo A ( D ) = −∑             × log 2 (            )
                                j =1   |D|                  |D|

   Ex.
                               5        5   4        4   5        5
          SplitInfo A ( D) = − ×log 2 ( ) − ×log 2 ( ) − ×log 2 ( ) = 0.926
                              14       14  14       14  14       14

     – gain_ratio(income) = 0.029/0.926 = 0.031
   The attribute with the maximum gain ratio is selected as the
    splitting attribute
CART
   If a data set D contains examples from n classes, gini index,
    gini(D) is defined as
                                      n 2
                        gini( D) =1− ∑ p j
                                     j =1
  where pj is the relative frequency of class j in D
 If a data set D is split on A into two subsets D1 and D2, the gini
  index gini(D) is defined as
                                 |D1|            |D |
                 gini A ( D) =        gini( D1) + 2 gini( D 2)
                                  |D|             |D|
   Reduction in Impurity:
                 ∆gini( A) = gini( D) − giniA ( D)

   The attribute provides the smallest ginisplit(D) (or the largest
    reduction in impurity) is chosen to split the node (need to
    enumerate all the possible splitting points for each attribute)
CART
   Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
                                           2         2
                                       9 5
                       gini ( D) = 1 −   −   = 0.459
                                        14   14 

   Suppose the attribute income partitions D into 10 in D1: {low,
    medium} and 4 in D2
                                                       10            4
                      giniincome∈{low,medium} ( D ) =  Gini ( D1 ) +  Gini ( D1 )
                                                       14             14 




     but gini{medium,high} is 0.30 and thus the best since it is the lowest
   All attributes are assumed continuous-valued
   May need other tools, e.g., clustering, to get the possible split
    values
   Can be modified for categorical attributes
Comparing Attribute Selection Measures

   The three measures, in general, return good results but
     – Information gain:
         • biased towards multivalued attributes
     – Gain ratio:
         • tends to prefer unbalanced splits in which one partition is
           much smaller than the others
     – Gini index:
         • biased to multivalued attributes
         • has difficulty when # of classes is large
         • tends to favor tests that result in equal-sized partitions
           and purity in both partitions
Other Attribute Selection Measures

   CHAID: a popular decision tree algorithm, measure based on χ2 test for
    independence
   C-SEP: performs better than info. gain and gini index in certain cases
   G-statistics: has a close approximation to χ2 distribution
   MDL (Minimal Description Length) principle (i.e., the simplest solution
    is preferred):
     – The best tree as the one that requires the fewest # of bits to both
       (1) encode the tree, and (2) encode the exceptions to the tree
   Multivariate splits (partition based on multiple variable combinations)
     – CART: finds multivariate splits based on a linear comb. of attrs.
   Which attribute selection measure is the best?
     – Most give good results, none is significantly superior than others
Underfitting and Overfitting

                      Overfitting




Underfitting: when model is too simple, both training and
test errors are large
Overfitting due to Noise




Decision boundary is distorted by noise point
Underfitting due to Insufficient Examples




Lack of data points in the lower half of the diagram makes it difficult
to predict correctly the class labels of that region
- Insufficient number of training records in the region causes the
decision tree to predict the test examples using other training
records that are irrelevant to the classification task
Two approaches to avoid Overfitting

   Prepruning:
     – Halt tree construction early—do not split a node if this would result
       in the goodness measure falling below a threshold
     – Difficult to choose an appropriate threshold


   Postpruning:
     – Remove branches from a “fully grown” tree—get a sequence of
       progressively pruned trees
     – Use a set of data different from the training data to decide
       which is the “best pruned tree”
Scalable Decision Tree Induction Methods
   ID3, C4.5, and CART are not efficient when the training set
    doesn’t fit the available memory. Instead the following algorithms
    are used

     – SLIQ
        • Builds an index for each attribute and only class list and
          the current attribute list reside in memory
     – SPRINT
        • Constructs an attribute list data structure
     – RainForest
        • Builds an AVC-list (attribute, value, class label)
     – BOAT
        • Uses bootstrapping to create several small samples
BOAT

   BOAT (Bootstrapped Optimistic Algorithm for Tree
    Construction)
    – Use a statistical technique called bootstrapping to create several
       smaller samples (subsets), each fits in memory
    – Each subset is used to create a tree, resulting in several trees
    – These trees are examined and used to construct a new tree T’
        • It turns out that T’ is very close to the tree that would be
          generated using the whole data set together
    – Adv: requires only two scans of DB, an incremental alg.
Why decision tree induction in data mining?
    Relatively faster learning speed (than other
     classification methods)

    Convertible to simple and easy to understand
     classification rules

    Comparable classification accuracy with other
     methods
Converting Tree to Rules
                         Outlook

                Sunny   Overcast    Rain

     Humidity              Yes             Wind

High       Normal                     Strong      Weak
No              Yes                  No             Yes

 R1: IF (Outlook=Sunny) AND (Humidity=High) THEN Play=No
 R2: IF (Outlook=Sunny) AND (Humidity=Normal) THEN Play=Yes
 R3: IF (Outlook=Overcast) THEN Play=Yes
 R4: IF (Outlook=Rain) AND (Wind=Strong) THEN Play=No
 R5: IF (Outlook=Rain) AND (Wind=Weak) THEN Play=Yes
Decision trees:
             The Weka tool

@relation weather.symbolic

@attribute   outlook {sunny, overcast, rainy}
@attribute   temperature {hot, mild, cool}
@attribute   humidity {high, normal}
@attribute   windy {TRUE, FALSE}
@attribute   play {yes, no}

@data
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
sunny,mild,high,FALSE,no
sunny,cool,normal,FALSE,yes
rainy,mild,normal,FALSE,yes
sunny,mild,normal,TRUE,yes
overcast,mild,high,TRUE,yes
overcast,hot,normal,FALSE,yes
rainy,mild,high,TRUE,no



http://www.cs.waikato.ac.nz/ml/weka/
Bayesian Classifier




    Thomas Bayes (1702-1761)
Basic Statistics
Assume
• D = All students
• X = ICS students
• C = SWE students               74              D


          X                6      4                    C
                                      16



  |X| = 10           P(X) = 10/100     P(X|C) = P(X,C)/P(C) = 4/20
  |C| = 20           P(C) = 20/100     P(C|X) = P(X,C)/P(X) = 4/10
  |D| = 100          P(X,C) = 4/100


              P(X,C) = P(C|X)*P(X) = P(X|C)*P(C)
Bayesian Classifier – Basic Equation

          P(X,C) = P(C|X)*P(X) = P(X|C)*P(C)



      Class Prior Probability          Descriptor Posterior Probability




                               P( C ) P( X | C )
                  P( C | X ) =
                                     P( X )

 Class Posterior Probability
                                 Descriptor Prior Probability
Naive Bayesian Classifier


                 P ( C | X ) = P( C ) P( X | C )
                                     P( X )
                                                                         P (C1 )
P( C1 | X ) = P( x1 | C1 ) P( x2 | C1 ) P( x3 | C1 ) .... P( xn | C1 )
                                                                         P(X)

                                                                       P(C2 )
P( C2 | X ) = P( x1 | C2 ) P( x2 | C2 ) P( x3 | C2 ) .... P( xn | C2 )
                                                                       P( X)

                                                                       P(Cm )
P( Cm | X ) = P( x1 | Cm ) P( x2 | Cm ) P( x3 | Cm ) .... P( xn | Cm )
                                                                       P( X)
      Independence assumption about descriptors
Training Data
Outlook    Temp     Humidity   Windy   Play?
sunny      hot      high       FALSE   No
sunny      hot      high       TRUE    No
overcast   hot      high       FALSE   Yes
rainy      mild     high       FALSE   Yes
rainy      cool     normal     FALSE   Yes
rainy      cool     Normal     TRUE    No
overcast   cool     Normal     TRUE    Yes
sunny      mild     High       FALSE   No
sunny      cool     Normal     FALSE   Yes
rainy      mild     Normal     FALSE   Yes
sunny      mild     normal     TRUE    Yes
overcast   mild     High       TRUE    Yes
overcast   hot      Normal     FALSE   Yes
rainy      mild     high       TRUE    No



                                               P(yes) = 9/14
                                               P(no) = 5/14
Bayesian Classifier – Probabilities for the weather data

                                                             Frequency Tables


 Outlook | No                Yes      Temp.          | No        Yes       Humidity | No               Yes      Windy         | No          Yes
 ----------------------------------   ----------------------------------   ----------------------------------   ----------------------------------
 Sunny         | 3             2      Hot           | 2            2       High          | 4             3      False          | 2            6
 ----------------------------------   ----------------------------------   ----------------------------------   ----------------------------------
 Overcast | 0                  4      Mild          | 2            4       Normal | 1                    6      True          | 3            3
 ----------------------------------   ----------------------------------
 Rainy          | 2            3      Cool          | 1            3




 Outlook | No                Yes      Temp.          | No        Yes       Humidity | No               Yes      Windy         | No          Yes
 ----------------------------------   ----------------------------------   ----------------------------------   ----------------------------------
 Sunny         | 3/5          2/9     Hot           | 2/5         2/9      High          | 4/5          3/9     False          | 2/5         6/9
 ----------------------------------   ----------------------------------   ----------------------------------   ----------------------------------
 Overcast | 0/5               4/9     Mild          | 2/5         4/9      Normal | 1/5                 6/9     True          | 3/5          3/9
 ----------------------------------   ----------------------------------
 Rainy          | 2/5         3/9     Cool          | 1/5         3/9



                                                               Likelihood Tables
Bayesian Classifier – Predicting a new day

         Outlook        Temp.        Humidity          Windy         Play
X        sunny          cool         high             true              ?              Class?

    P(yes|X) = p(sunny|yes) x p(cool|yes) x p(high|yes) x p(true|yes) x p(yes)

              = 2/9 x 3/9 x 3/9 x 3/9 x 9/14 = 0.0053 => 0.0053/(0.0053+0.0206) = 0.205


    P(no|X) = p(sunny|no) x p(cool|no) x p(high|no) x p(true|no) x p(no)

              = 3/5 x 1/5 x 4/5 x 3/5 x 5/14 = 0.0206=0.0206/(0.0053+0.0206) = 0.795
Bayesian Classifier – zero frequency problem

    What if a descriptor value doesn’t occur with every class value

                     P(outlook=overcast|No)=0

    Remedy: add 1 to the count for every descriptor-class combination
            (Laplace Estimator)



Outlook | No                Yes      Temp.          | No        Yes       Humidity | No               Yes      Windy         | No          Yes
----------------------------------   ----------------------------------   ----------------------------------   ----------------------------------
Sunny         | 3+1          2+1     Hot           | 2+1         2+1      High          | 4+1 3+1              False          | 2+1 6+1
----------------------------------   ----------------------------------   ----------------------------------   ----------------------------------
Overcast | 0+1 4+1                   Mild          | 2+1 4+1              Normal | 1+1 6+1                     True          | 3+1 3+1
----------------------------------   ----------------------------------
Rainy          | 2+1 3+1             Cool          | 1+1         3+1
Bayesian Classifier – General Equation



                                       P ( X | Ck ) P( Ck )
                      P ( Ck | X ) =
                                              P( X )


Likelihood:                    P ( X | Ck )


                                     1           ( x − µ )2 
Continues variable: P ( x | C ) =            exp−           
                                  (2πσ )
                                      2 1/ 2
                                                    2σ 2 
Bayesian Classifier – Dealing with numeric attributes
Bayesian Classifier – Dealing with numeric attributes
Naïve Bayesian Classifier: Comments

 Advantages
   – Easy to implement
   – Good results obtained in most of the cases
 Disadvantages
   – Assumption: class conditional independence, therefore loss of
     accuracy
   – Practically, dependencies exist among variables
      • E.g., hospitals: patients: Profile: age, family history, etc.
       Symptoms: fever, cough etc., Disease: lung cancer,
        diabetes, etc.
      • Dependencies among these cannot be modeled by Naïve
        Bayesian Classifier
 How to deal with these dependencies?
   – Bayesian Belief Networks
Bayesian Belief Networks

   Bayesian belief network allows a subset of the variables
    conditionally independent
   A graphical model of causal relationships
     – Represents dependency among the variables
     – Gives a specification of joint probability distribution

                              Nodes: random variables
                              Links: dependency
    X           Y             X and Y are the parents of Z, and Y is
                             the parent of P
         Z                    No dependency between Z and P
                    P
                              Has no loops or cycles
Bayesian Belief Network: An Example

                            The conditional probability table
  Family                    (CPT) for variable LungCancer:
                Smoker
  History
                                       (FH, S) (FH, ~S) (~FH, S) (~FH, ~S)

                                LC        0.8       0.5      0.7      0.1
                              ~LC         0.2       0.5      0.3       0.9
LungCancer     Emphysema

                             CPT shows the conditional probability for
                             each possible combination of its parents


PositiveXRay     Dyspnea    Derivation of the probability of a
                            particular combination of values of X,
                            from CPT:
                                                 n
 Bayesian Belief Networks   P ( x1 ,..., xn ) = ∏ P ( x i | Parents (Y i ))
                                               i =1
Training Bayesian Networks

   Several scenarios:

    – Given both the network structure and all variables
      observable: learn only the CPTs

    – Network structure known, some hidden variables: gradient
      descent (greedy hill-climbing) method, analogous to neural
      network learning

    – Network structure unknown, all variables observable:
      search through the model space to reconstruct network
      topology

    – Unknown structure, all hidden variables: No good
      algorithms known for this purpose.
Support Vector Machines
Sabic
   Email Mohammed S. Al-Shahrani
     – shahranims@sabic.com
Support Vector Machines




   Find a linear hyperplane (decision boundary) that will separate the
    data
Support Vector Machines




   One Possible Solution
Support Vector Machines




   Another possible solution
Support Vector Machines




   Other possible solutions
Support Vector Machines




    Which one is better? B1 or B2?
    How do you define better?
Support Vector Machines




   Find a hyper plane that maximizes the margin => B1 is better than B2
Support Vectors
                  Support Vectors
Support Vector Machines




           Support Vectors
Support Vector Machines

     
    w• x + b = 0
                                   
                                  w • x + b = +1


  
 w • x + b = −1



                   
        1   if w • x + b ≥ 1              2
f ( x) =                      Margin =  2
         −1 if w • x + b ≤ −1           || w ||
Finding the Decision Boundary

   Let {x1, ..., xn} be our data set and let yi ∈ {1,-1} be the class
    label of xi
   The decision boundary should classify all points correctly ⇒



   The decision boundary can be found by solving the following
    constrained optimization problem




   This is a constrained optimization problem. Solving it is beyond
    our course
Support Vector Machines
                                          2
   We want to maximize: Margin =          2
                                       || w ||
                                                        2
                                                    || w ||
    – Which is equivalent to minimizing:    L( w) =
                                                       2
    – But subjected to the following constraints:
                              
                   1   if w • x i + b ≥ 1
         f ( xi ) =         
                    −1 if w • x i + b ≤ −1
        • This is a constrained optimization problem
            – Numerical approaches to solve it (e.g., quadratic
              programming)
Classifying new Tuples

   The decision boundary is determined only by the support vectors

   Let tj (j=1, ..., s) be the indices of the s support vectors.


   For testing with a new data z

     – Compute                                                      and
       classify z as class 1 if the sum is positive, and class 2
       otherwise
Support Vector Machines


                     Support Vectors
Support Vector Machines
   What if the training set is not linearly separable?

   Slack variables ξi can be added to allow misclassification of
    difficult or noisy examples, resulting margin called soft.




                                  ξi
                    ξi
Support Vector Machines
   What if the problem is not linearly separable?
    – Introduce slack variables
        • Need to minimize:
                              2
                          || w ||      N
                                            k
                  L( w) =         + C ∑ ξi 
                             2        i =1 
        • Subject to:
                           
                1    if w • x i + b ≥ 1 - ξi
      f ( xi ) =         
                 −1 if w • x i + b ≤ −1 + ξi
Nonlinear Support Vector Machines

   What if decision boundary is not linear?
Non-linear SVMs
   Datasets that are linearly separable with some noise work out
    great:

                          0                   x

   But what are we going to do if the dataset is just too hard?

                          0                   x

   How about… mapping data to a higher-dimensional space:
                             x2




                         0                    x
Non-linear SVMs: Feature spaces

   General idea: the original feature space can always be mapped to
    some higher-dimensional feature space where the training set is
    separable:




                           Φ: x → φ(x)
prediction

Linear Regression
What Is Prediction?
   (Numerical) prediction is similar to classification

     – construct a model

     – use model to predict continuous or ordered value for a given
       input
   Prediction is different from classification

     – Classification refers to predict categorical class label

     – Prediction models continuous-valued functions
   Major method for prediction: regression

     – model the relationship between one or more predictor
       variables and a response variable
Prediction


Response                                 Training data
           Attribute (Y)




                             Attribute (X)

                            Predictor
Types of Correlation




Positive correlation   Negative correlation   No correlation
Regression Analysis
   Simple Linear regression
   multiple regression
   Non-linear regression
   Other regression methods:
    – generalized linear model,
    – Poisson regression,
    – log-linear models,
    – regression trees
Simple Linear Regression
describes the linear relationship between a predictor variable,
plotted on the x-axis, and a response variable, plotted on the
y-axis

              Y




                             X
Simple Linear Regression

    Y = βo + β X
              1



                     β1
    Y


               1.0

   βo
           X
Simple Linear Regression

Y




           X
Simple Linear Regression




                 ε
Y



    ε


           X
Simple Linear Regression

Fitting data to a linear model


Yi = β o + β1 X i + ε i

    intercept      slope   residuals
Simple Linear Regression


How to fit data to a linear model?




     Least Square Method
Least Squares Regression

                 ˆ
    Model line: Y = β 0 + β1 X
    Residual (ε) = Y − Yˆ

    Sum of squares of residuals = ∑ ˆ
                                   (Y − Y ) 2


   we must find values of β o and β1 that minimise


                        ∑ ˆ
                         (Y − Y ) 2
Linear Regression

   A model line: y = w0 + w1 x acquired by using Method
    of least squares to estimates the best-fitting straight
    line has:
                w = y−w x
                 0     1
                        | D|

                       ∑( x           − x )( yi − y )
                w =
                                  i
                       i=1

                 1  ∑( x
                               | D|

                                        i   − x )2
                               i=1
Multiple Linear Regression
   Multiple linear regression: involves more than one predictor
    variable
   The linear model with a single predictor variable X can easily
    be extended to two or more predictor variables

       Y = β o + β1 X 1 + β 2 X 2 + ... + β p X p + ε

     – Solvable by extension of least square method or using SAS,
       S-Plus
Nonlinear Regression
   Some nonlinear models can be modeled by a polynomial
    function
   A polynomial regression model can be transformed into linear
    regression model. For example,
         y = w0 + w1 x + w2 x2 + w3 x3
     convertible to linear with new variables: x2 = x2, x3= x3
         y = w0 + w1 x + w2 x2 + w3 x3
   Other functions, such as power function, can also be
    transformed to linear model
   Some models are intractable nonlinear
     – possible to obtain least square estimates through extensive
       calculation on more complex formulae
Artificial Neural Networks
           (ANN)
What is a ANN?
   ANN is a data structure that supposedly simulates
    the behavior of neurons in a biological brain.
   ANN is composed of layers of units interconnected.
   Messages are passed along the connections from
    one unit to the other.
   Messages can change based on the weight of the
    connection and the value in the node
General Structure of ANN




x0    w0         - µk
x1    w1
            ∑           f
 xn   wn
ANN




Output Y is 1 if at least two of the three inputs are equal to 1.
ANN




 Y = I (0.3 X 1 + 0.3 X 2 + 0.3 X 3 − 0.4 > 0)
                 1   if z is true
 where I ( z ) = 
                 0   otherwise
Artificial Neural Networks

   Model is an assembly of
    inter-connected nodes and
    weighted links

   Output node sums up each
    of its input value according
    to the weights of its links
                                   Perceptron Model

   Compare output node                   Y = I ( ∑wi X i − t ) or
    against some threshold t                       i

                                       Y = sign( ∑ wi X i − t )
                                                   i
Neural Networks
   Advantages
    – prediction accuracy is generally high.
    – robust, works when training examples contain errors.
    – output may be discrete, real-valued, or a vector of several
      discrete or real-valued attributes.
    – fast evaluation of the learned target function.
   Criticism
    – long training time.
    – difficult to understand the learned function (weights).
    – not easy to incorporate domain knowledge.
Learning Algorithms
   Back propagation for classification
   Kohonen feature maps for clustering
   Recurrent back propagation for classification
   Radial basis function for classification
   Adaptive resonance theory
   Probabilistic neural networks
Major Steps for Back Propagation
            Network
   Constructing a network
    – input data representation
    – selection of number of layers, number of nodes in
      each layer.
   Training the network using training data
   Pruning the network
   Interpret the results
A Multi-Layer Feed-Forward Neural Network




                           wij

                                  I j = ∑ wij Oi + θ j
                                         i



                                             1
                                 Oj =            −I j
                                        1+ e
How A Multi-Layer Neural Network Works?
   The inputs to the network correspond to the attributes measured for
    each training tuple
   Inputs are fed simultaneously into the units making up the input layer
   They are then weighted and fed simultaneously to a hidden layer
   The number of hidden layers is arbitrary, although usually only one
   The weighted outputs of the last hidden layer are input to units making
    up the output layer, which emits the network's prediction
   The network is feed-forward in that none of the weights cycles back to
    an input unit or to an output unit of a previous layer
   From a statistical point of view, networks perform nonlinear
    regression: Given enough hidden units and enough training samples,
    they can closely approximate any function
Defining a Network Topology
   First decide the network topology: # of units in the input layer,
    # of hidden layers (if > 1), # of units in each hidden layer, and #
    of units in the output layer
   Normalizing the input values for each attribute measured in the
    training tuples to [0.0—1.0]
   One input unit per domain value
   Output, if for classification and more than two classes, one
    output unit per class is used
   Once a network has been trained and its accuracy is
    unacceptable, repeat the training process with a different
    network topology or a different set of initial weights
Backpropagation
   Iteratively process a set of training tuples & compare the network's
    prediction with the actual known target value
   For each training tuple, the weights are modified to minimize the
    mean squared error between the network's prediction and the
    actual target value
   Modifications are made in the “backwards” direction: from the
    output layer, through each hidden layer down to the first hidden
    layer, hence “backpropagation”
   Steps
     – Initialize weights (to small random #s) and biases in the network
     – Propagate the inputs forward (by applying activation function)
     – Backpropagate the error (by updating weights and biases)
     – Terminating condition (when error is very small, etc.)
Backpropagation


                  Err j = O j (1 − O j )∑ Errk w jk
                                        k


                       wij = wij + (l ) Err j Oi



                      θ j = θ j + (l) Err j

                    Err j = O j (1 − O j )(T j − O j )

Generated value               Correct value
Network Pruning
   Fully connected network will be hard to articulate
   n input nodes, h hidden nodes and m output nodes
    lead to h(m+n) links (weights)
   Pruning: Remove some of the links without affecting
    classification accuracy of the network.
Other Classification Methods
   Associative classification : Association rule based condSet
    class
   Genetic algorithm : Initial population of encoded rules are
    changed by mutation and cross-over based on survival of
    accurate once (survival).
   K-nearest neighbor classifier : Learning by analogy.
   Case-based reasoning : Similarity with other cases.
   Rough set theory : Approximation to equivalence classes.
   Fuzzy sets: Based on fuzzy logic (truth values between 0..1).
Lazy Learners
Lazy vs. Eager Learning
   Lazy vs. eager learning
     – Lazy learning (e.g., instance-based learning): Simply
       stores training data (or only minor processing) and waits
       until it is given a test tuple
     – Eager learning (the above discussed methods): Given a
       set of training set, constructs a classification model
       before receiving new (e.g., test) data to classify
   Lazy: less time in training but more time in predicting
Lazy Learner: Instance-Based Methods
     Instance-based learning:
      – Store training examples and delay the processing (“lazy
        evaluation”) until a new instance must be classified
     Typical approaches
      – k-nearest neighbor approach
          • Instances represented as points in a Euclidean
            space.
      – Case-based reasoning
          • Uses symbolic representations and knowledge-
            based inference
Nearest Neighbor Classifiers
 Basic    idea:
  – If it walks like a duck, quacks like a duck, then it’s
    probably a duck

                                Compute
                                Distance             Test
                                                    Record




                                Choose k of the
                                “nearest” records
Training
records
Instance-Based Classifiers

                  • Store the training records
                  • Use training records to
                    predict the class label of
                    unseen cases
Definition of Nearest Neighbor


           X                       X                        X




(a) 1-nearest neighbor   (b) 2-nearest neighbor   (c) 3-nearest neighbor


 K-nearest neighbors of a record x are data points
 that have the k smallest distance to x
The k-Nearest Neighbor Algorithm
   All instances correspond to points in the n-D space
   The nearest neighbor are defined in terms of Euclidean
    distance, dist(X1, X2)
   Target function could be discrete- or real- valued
   For discrete-valued, k-NN returns the most common value
    among the k training examples nearest to xq
   Vonoroi diagram: the decision surface induced by 1-NN for a
    typical set of training examples
        _
                         _   _
                                                  .
                _
    +
            _       .
                         +
                             +
                                      .               .      .
                        xq
        _           +                         .
Nearest-Neighbor Classifiers
Requires three things
 – The set of stored records
 – Distance Metric to compute
   distance between records
 – The value of k, the number of
   nearest neighbors to retrieve


To classify an unknown record:
 – Compute distance to other training
    records
 – Identify k nearest neighbors
 – Use class labels of nearest
   neighbors to determine the class
   label of unknown record (e.g., by
   taking majority vote)
Nearest Neighbor Classification
   Compute distance between two points:
    – Euclidean distance

      d ( p, q ) =       ∑( p i
                          i
                                    −q )
                                      i
                                           2




   Determine the class from nearest neighbor list
    – take the majority vote of class labels among the k-
      nearest neighbors
    – Weigh the vote according to distance
        • weight factor, w = 1/d2
Nearest Neighbor Classification…
   Scaling issues
     – Attributes may have to be scaled to prevent
       distance measures from being dominated by one of
       the attributes
     – Example:
        • height of a person may vary from 1.5m to 1.8m
        • weight of a person may vary from 90lb to 300lb
        • income of a person may vary from $10K to $1M
Nearest Neighbor Classification…
   Choosing the value of k:
     – If k is too small, sensitive to noise points
     – If k is too large, neighborhood may include points from other
       classes
Metrics for Performance Evaluation
   Focus on the predictive capability of a model
     – Rather than how fast it takes to classify or build models,
       scalability, etc.
   Confusion Matrix:

                   PREDICTED CLASS                  a: TP (true positive)

                         Class=Yes   Class=No       b: FN (false negative)
                                                    c: FP (false positive)
             Class=Yes       a            b
 ACTUAL                                             d: TN (true negative)
  CLASS Class=No              c           d
Metrics for Performance Evaluation…

                            PREDICTED CLASS

                              Class=Yes   Class=No

       ACTUAL   Class=Yes          a            b
       CLASS                     (TP)         (FN)
                Class=No           c            d
                                 (FP)         (TN)


   Most widely-used metric:
                        a+d            TP + TN
       Accuracy =                =
                    a + b + c + d TP + TN + FP + FN

       Error Rate = 1 - Accuracy
Limitation of Accuracy
   Consider a 2-class problem
    – Number of Class 0 examples = 9990
    – Number of Class 1 examples = 10


   If model predicts everything to be class 0, accuracy is
    9990/10000 = 99.9 %
    – Accuracy is misleading because model does not
      detect any class 1 example
Alternative Classifier Accuracy Measures

   accuracy = sensitivity * pos/(pos + neg) + specificity * neg/(pos + neg)

     – sensitivity = tp/pos        /* true positive recognition rate */

     – specificity = tn/neg        /* true negative recognition rate */
   precision = tp/(tp + fp)
Predictor Error Measures
   Test error (generalization error): the average loss over the test set
                                 d

     – Mean absolute error: ∑| yi − yi ' |
                            i =1

                                        d
                                  d

     – Mean squared error:      ∑(y
                                 i =1
                                        i   − yi ' ) 2

                                            d
                                                     d

                                                 ∑y
                                                  |          i   −yi ' |
     – Relative absolute error:                   i=
                                                   d
                                                    1


                                                  ∑y
                                                   |
                                                   i=1
                                                             i    −y |


                                                 d

                                                ∑(y
                                                i =1
                                                         i       − yi ' ) 2
     – Relative squared error:                     d

                                                 ∑(y
                                                  i =1
                                                             i   − y)2

     – The mean squared-error exaggerates the presence of outliers
       Popularly use (square) root mean-square error, similarly, root
       relative squared error
Evaluating Accuracy
   Holdout method
     – Given data is randomly partitioned into two independent sets
         • Training set (e.g., 2/3) for model construction
         • Test set (e.g., 1/3) for accuracy estimation
     – Random sampling: a variation of holdout
         • Repeat holdout k times, accuracy = avg. of the
           accuracies obtained
   Cross-validation (k-fold, where k = 10 is most popular)
     – Randomly partition the data into k mutually exclusive
       subsets, each approximately equal size
     – At i-th iteration, use Di as test set and others as training set
Evaluating Accuracy
   Bootstrap
    – Works well with small data sets
    – Samples the given training tuples uniformly with replacement
   Several boostrap methods, and a common one is .632 boostrap
    – Suppose we are given a data set of d tuples. The data set is sampled
      d times, with replacement, resulting in a training set of d samples. The
      data tuples that did not make it into the training set end up forming the
      test set. About 63.2% of the original data will end up in the bootstrap,
      and the remaining 36.8% will form the test set (since (1 – 1/d)d ≈ e-1 =
      0.368)
    – Repeat the sampling procedure k times, overall accuracy of the model:
                     k
       acc( M ) = ∑ (0.632 × acc( M i ) test _ set +0.368 × acc( M i ) train _ set )
                    i =1
Ensemble Methods
   Construct a set of classifiers from the training data
   Predict class label of previously unseen records by
    aggregating predictions made by multiple classifiers
     – Use a combination of models to increase accuracy
     – Combine a series of k learned models, M1, M2, …, Mk, with the aim
       of creating an improved model M*
   Popular ensemble methods
     – Bagging
         • averaging the prediction over a collection of classifiers
     – Boosting
         • weighted vote with a collection of classifiers
General Idea
Bagging: Boostrap Aggregation
   Analogy: Diagnosis based on multiple doctors’ majority vote
   Training
     – Given a set D of d tuples, at each iteration i, a training set Di of d
       tuples is sampled with replacement from D (i.e., boostrap)
     – A classifier model Mi is learned for each training set Di
   Classification: classify an unknown sample X
     – Each classifier Mi returns its class prediction
     – The bagged classifier M* counts the votes and assigns the class
       with the most votes to X
   Prediction: can be applied to the prediction of continuous values
    by taking the average value of each prediction for a given test
    tuple
Bagging: Boostrap Aggregation
   Accuracy
    – Often significant better than a single classifier derived
      from D
    – For noise data: not considerably worse, more robust
    – Proved improved accuracy in prediction
Boosting
   Analogy: Consult several doctors, based on a combination of
    weighted diagnoses—weight assigned based on the previous
    diagnosis accuracy
   How boosting works?
    –   Weights are assigned to each training tuple
    –   A series of k classifiers is iteratively learned
    –   After a classifier Mi is learned, the weights are updated to
        allow the subsequent classifier, Mi+1, to pay more attention to
        the training tuples that were misclassified by Mi

    –   The final M* combines the votes of each individual classifier,
        where the weight of each classifier's vote is a function of its
        accuracy
Boosting

   The boosting algorithm can be extended for the
    prediction of continuous values

   Comparing with bagging: boosting tends to achieve
    greater accuracy, but it also risks overfitting the
    model to misclassified data
Boosting: Adaboost
   Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)
   Initially, all the weights of tuples are set the same (1/d)
   Generate k classifiers in k rounds. At round i,
    –   Tuples from D are sampled (with replacement) to form a training set
        Di of the same size
    –   Each tuple’s chance of being selected is based on its weight
    –   A classification model Mi is derived from Di
    –   Its error rate is calculated using Di as a test set
    –   If a tuple is misclassified, its weight is increased, otherwise it is
        decreased
   Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier
    Mi error rate is the sum of the weights of the misclassified tuples:
                           d
           error ( M i ) = ∑ j ×err ( X j )
                            w
                           j
                                                         1 − error ( M i )
                                                   log
                                                           error ( M i )
   The weight of classifier Mi’s vote is
Summary
 Classification Vs prediction
 Eager learners
     –   Decision tree
     –   Bayesian
     –   Support vector Machines (SVM)
     –   Neural Networks
     –   Linear regression
   Lazy learners
     – K-Nearest Neighbor (KNN)
   Performance (Accuracy) Evaluation
     – Holdout
     – Cross validation
     – Bootstrap
   Ensemble Methods
     – Bagging
     – Boosting
END

Más contenido relacionado

La actualidad más candente

La actualidad más candente (6)

Data Mining
Data MiningData Mining
Data Mining
 
MLEARN 210 B Autumn 2018: Lecture 1
MLEARN 210 B Autumn 2018: Lecture 1MLEARN 210 B Autumn 2018: Lecture 1
MLEARN 210 B Autumn 2018: Lecture 1
 
Slide3.ppt
Slide3.pptSlide3.ppt
Slide3.ppt
 
Decision tree
Decision treeDecision tree
Decision tree
 
Introduction er & eer
Introduction er &  eerIntroduction er &  eer
Introduction er & eer
 
Introduction er & eer
Introduction er & eerIntroduction er & eer
Introduction er & eer
 

Similar a data minig

Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data miningKamal Acharya
 
Lect9 Decision tree
Lect9 Decision treeLect9 Decision tree
Lect9 Decision treehktripathy
 
classification in data mining and data warehousing.pdf
classification in data mining and data warehousing.pdfclassification in data mining and data warehousing.pdf
classification in data mining and data warehousing.pdf321106410027
 
Decision tree for data mining and computer
Decision tree for data mining and computerDecision tree for data mining and computer
Decision tree for data mining and computertttiba
 
DM Unit-III ppt.ppt
DM Unit-III ppt.pptDM Unit-III ppt.ppt
DM Unit-III ppt.pptLaxmi139487
 
Credit Card Fraudulent Transaction Detection.pptx
Credit Card Fraudulent Transaction Detection.pptxCredit Card Fraudulent Transaction Detection.pptx
Credit Card Fraudulent Transaction Detection.pptxssuser67d31a1
 
Lecture 5 Decision tree.pdf
Lecture 5 Decision tree.pdfLecture 5 Decision tree.pdf
Lecture 5 Decision tree.pdfssuser4c50a9
 
Data mining chapter04and5-best
Data mining chapter04and5-bestData mining chapter04and5-best
Data mining chapter04and5-bestABDUmomo
 
Classification Using Decision Trees and RulesChapter 5.docx
Classification Using Decision Trees and RulesChapter 5.docxClassification Using Decision Trees and RulesChapter 5.docx
Classification Using Decision Trees and RulesChapter 5.docxmonicafrancis71118
 
Dm bs-lec7-classification - dti
Dm bs-lec7-classification - dtiDm bs-lec7-classification - dti
Dm bs-lec7-classification - dtiammarhaiderengr
 
unit classification.pptx
unit  classification.pptxunit  classification.pptx
unit classification.pptxssuser908de6
 
Classification and prediction
Classification and predictionClassification and prediction
Classification and predictionAcad
 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision treeKrish_ver2
 
Dataming-chapter-7-Classification-Basic.pptx
Dataming-chapter-7-Classification-Basic.pptxDataming-chapter-7-Classification-Basic.pptx
Dataming-chapter-7-Classification-Basic.pptxHimanshuSharma997566
 
Barga Data Science lecture 4
Barga Data Science lecture 4Barga Data Science lecture 4
Barga Data Science lecture 4Roger Barga
 

Similar a data minig (20)

Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data mining
 
Lect9 Decision tree
Lect9 Decision treeLect9 Decision tree
Lect9 Decision tree
 
Dbm630 lecture06
Dbm630 lecture06Dbm630 lecture06
Dbm630 lecture06
 
classification in data mining and data warehousing.pdf
classification in data mining and data warehousing.pdfclassification in data mining and data warehousing.pdf
classification in data mining and data warehousing.pdf
 
Decision tree for data mining and computer
Decision tree for data mining and computerDecision tree for data mining and computer
Decision tree for data mining and computer
 
DM Unit-III ppt.ppt
DM Unit-III ppt.pptDM Unit-III ppt.ppt
DM Unit-III ppt.ppt
 
Data mining
Data miningData mining
Data mining
 
Credit Card Fraudulent Transaction Detection.pptx
Credit Card Fraudulent Transaction Detection.pptxCredit Card Fraudulent Transaction Detection.pptx
Credit Card Fraudulent Transaction Detection.pptx
 
Lecture 5 Decision tree.pdf
Lecture 5 Decision tree.pdfLecture 5 Decision tree.pdf
Lecture 5 Decision tree.pdf
 
Lecture4.pptx
Lecture4.pptxLecture4.pptx
Lecture4.pptx
 
Data mining chapter04and5-best
Data mining chapter04and5-bestData mining chapter04and5-best
Data mining chapter04and5-best
 
Classification Using Decision Trees and RulesChapter 5.docx
Classification Using Decision Trees and RulesChapter 5.docxClassification Using Decision Trees and RulesChapter 5.docx
Classification Using Decision Trees and RulesChapter 5.docx
 
07 learning
07 learning07 learning
07 learning
 
5 learning
5 learning5 learning
5 learning
 
Dm bs-lec7-classification - dti
Dm bs-lec7-classification - dtiDm bs-lec7-classification - dti
Dm bs-lec7-classification - dti
 
unit classification.pptx
unit  classification.pptxunit  classification.pptx
unit classification.pptx
 
Classification and prediction
Classification and predictionClassification and prediction
Classification and prediction
 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision tree
 
Dataming-chapter-7-Classification-Basic.pptx
Dataming-chapter-7-Classification-Basic.pptxDataming-chapter-7-Classification-Basic.pptx
Dataming-chapter-7-Classification-Basic.pptx
 
Barga Data Science lecture 4
Barga Data Science lecture 4Barga Data Science lecture 4
Barga Data Science lecture 4
 

Último

How to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseHow to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseCeline George
 
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQuiz Club NITW
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Projectjordimapav
 
4.9.24 School Desegregation in Boston.pptx
4.9.24 School Desegregation in Boston.pptx4.9.24 School Desegregation in Boston.pptx
4.9.24 School Desegregation in Boston.pptxmary850239
 
Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1GloryAnnCastre1
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfVanessa Camilleri
 
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
Unraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptxUnraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptx
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptxDhatriParmar
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4JOYLYNSAMANIEGO
 
CLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptxCLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptxAnupam32727
 
CHEST Proprioceptive neuromuscular facilitation.pptx
CHEST Proprioceptive neuromuscular facilitation.pptxCHEST Proprioceptive neuromuscular facilitation.pptx
CHEST Proprioceptive neuromuscular facilitation.pptxAneriPatwari
 
Sulphonamides, mechanisms and their uses
Sulphonamides, mechanisms and their usesSulphonamides, mechanisms and their uses
Sulphonamides, mechanisms and their usesVijayaLaxmi84
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWQuiz Club NITW
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSMae Pangan
 
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnvESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnvRicaMaeCastro1
 
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxBIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxSayali Powar
 

Último (20)

How to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 DatabaseHow to Make a Duplicate of Your Odoo 17 Database
How to Make a Duplicate of Your Odoo 17 Database
 
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITWQ-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
Q-Factor HISPOL Quiz-6th April 2024, Quiz Club NITW
 
ClimART Action | eTwinning Project
ClimART Action    |    eTwinning ProjectClimART Action    |    eTwinning Project
ClimART Action | eTwinning Project
 
4.9.24 School Desegregation in Boston.pptx
4.9.24 School Desegregation in Boston.pptx4.9.24 School Desegregation in Boston.pptx
4.9.24 School Desegregation in Boston.pptx
 
Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1Reading and Writing Skills 11 quarter 4 melc 1
Reading and Writing Skills 11 quarter 4 melc 1
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdf
 
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
Unraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptxUnraveling Hypertext_ Analyzing  Postmodern Elements in  Literature.pptx
Unraveling Hypertext_ Analyzing Postmodern Elements in Literature.pptx
 
Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of EngineeringFaculty Profile prashantha K EEE dept Sri Sairam college of Engineering
Faculty Profile prashantha K EEE dept Sri Sairam college of Engineering
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
Mattingly "AI & Prompt Design: Large Language Models"
Mattingly "AI & Prompt Design: Large Language Models"Mattingly "AI & Prompt Design: Large Language Models"
Mattingly "AI & Prompt Design: Large Language Models"
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4
 
CLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptxCLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptx
 
CHEST Proprioceptive neuromuscular facilitation.pptx
CHEST Proprioceptive neuromuscular facilitation.pptxCHEST Proprioceptive neuromuscular facilitation.pptx
CHEST Proprioceptive neuromuscular facilitation.pptx
 
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptxINCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
 
Sulphonamides, mechanisms and their uses
Sulphonamides, mechanisms and their usesSulphonamides, mechanisms and their uses
Sulphonamides, mechanisms and their uses
 
Mythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITWMythology Quiz-4th April 2024, Quiz Club NITW
Mythology Quiz-4th April 2024, Quiz Club NITW
 
Textual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHSTextual Evidence in Reading and Writing of SHS
Textual Evidence in Reading and Writing of SHS
 
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnvESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
ESP 4-EDITED.pdfmmcncncncmcmmnmnmncnmncmnnjvnnv
 
Paradigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTAParadigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTA
 
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptxBIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
BIOCHEMISTRY-CARBOHYDRATE METABOLISM CHAPTER 2.pptx
 

data minig

  • 1. Classification and Prediction
  • 2. - The Course DS OLAP DS DP DW DM Association DS Classification DS = Data source Clustering DW = Data warehouse DM = Data Mining DP = Staging Database
  • 3. Chapter Objectives  Learn basic techniques for data classification and prediction.  Realize the difference between the following classifications of data: – supervised classification – prediction – unsupervised classification
  • 4. Chapter Outline  What is classification and prediction of data?  How do we classify data by decision tree induction?  What are neural networks and how can they classify?  What is Bayesian classification?  Are there other classification techniques?  How do we predict continuous values?
  • 5. What is Classification?  The goal of data classification is to organize and categorize data in distinct classes. – A model is first created based on the data distribution. – The model is then used to classify new data. – Given the model, a class can be predicted for new data.  Classification = prediction for discrete and nominal values
  • 6. What is Prediction?  The goal of prediction is to forecast or deduce the value of an attribute based on values of other attributes. – A model is first created based on the data distribution. – The model is then used to predict future or unknown values  In Data Mining – If forecasting discrete value  Classification – If forecasting continuous value  Prediction
  • 7. Supervised and Unsupervised  Supervised Classification = Classification – We know the class labels and the number of classes  Unsupervised Classification = Clustering – We do not know the class labels and may not know the number of classes
  • 8. Preparing Data Before Classification  Data transformation: – Discretization of continuous data – Normalization to [-1..1] or [0..1]  Data Cleaning: – Smoothing to reduce noise  Relevance Analysis: – Feature selection to eliminate irrelevant attributes
  • 9. Application  Credit approval  Target marketing  Medical diagnosis  Defective parts identification in manufacturing  Crime zoning  Treatment effectiveness analysis  Etc
  • 10. Classification is a 3-step process  1. Model construction (Learning): • Each tuple is assumed to belong to a predefined class, as determined by one of the attributes, called the class label. • The set of all tuples used for construction of the model is called training set. – The model is represented in the following forms: • Classification rules, (IF-THEN statements), • Decision tree • Mathematical formulae
  • 11. 1. Classification Process (Learning) Name Income Age Credit Samir Low <30 rating Classification Method bad Ahmed Medium [30...40 ] good Salah High <30 good Ali Medium >40 good Classification Model Sami Low [30..40] good Emad Medium <30 bad IF Income = ‘High’ Training Data class OR Age > 30 THEN Class = ‘Good OR Decision Tree OR Mathematical For
  • 12. Classification is a 3-step process 2. Model Evaluation (Accuracy): – Estimate accuracy rate of the model based on a test set. – The known label of test sample is compared with the classified result from the model. – Accuracy rate is the percentage of test set samples that are correctly classified by the model. – Test set is independent of training set otherwise over-fitting will occur
  • 13. 2. Classification Process (Accuracy Evaluation) Classification Model Name Income Age Credit rating Model Naser Low <30 Bad Bad Accuracy Lutfi Medium <30 Bad good 75% Adel High >40 good good Fahd Medium [30..40] good good class
  • 14. Classification is a three-step process 3. Model Use (Classification): – The model is used to classify unseen objects. • Give a class label to a new tuple • Predict the value of an actual attribute
  • 15. 3. Classification Process (Use) Classification Model Name Income Age Credit rating Adham Low <30 ?
  • 16. Classification Methods Classification Method  Decision Tree Induction  Neural Networks  Bayesian Classification  Association-Based Classification  K-Nearest Neighbour  Case-Based Reasoning  Genetic Algorithms  Rough Set Theory  Fuzzy Sets  Etc.
  • 17. Evaluating Classification Methods  Predictive accuracy – Ability of the model to correctly predict the class label  Speed and scalability – Time to construct the model – Time to use the model  Robustness – Handling noise and missing values  Scalability – Efficiency in large databases (not memory resident data)  Interpretability: – The level of understanding and insight provided by the model
  • 18. Chapter Outline  What is classification and prediction of data?  How do we classify data by decision tree induction ?  What are neural networks and how can they classify?  What is Bayesian classification?  Are there other classification techniques?  How do we predict continuous values?
  • 20. What is a Decision Tree?  A decision tree is a flow-chart-like tree structure. – Internal node denotes a test on an attribute – Branch represents an outcome of the test • All tuples in branch have the same value for the tested attribute.  Leaf node represents class label or class label distribution
  • 21. Sample Decision Tree Excellent customers Fair customers 80 Income < 6K >= 6K Age 50 No YES 20 2000 6000 10000 Income
  • 22. Sample Decision Tree 80 Income <6k >=6k NO Age Age 50 >=50 <50 NO Yes 20 2000 6000 10000 Income
  • 23. Sample Decision Tree Outlook Temp Humidity Windy Play? sunny hot high FALSE No sunny hot high TRUE No overcast hot high FALSE Yes rainy mild high FALSE Yes rainy cool normal FALSE Yes rainy cool Normal TRUE No overcast cool Normal TRUE Yes sunny mild High FALSE No sunny cool Normal FALSE Yes rainy mild Normal FALSE Yes sunny mild normal TRUE Yes overcast mild High TRUE Yes overcast hot Normal FALSE Yes rainy mild high TRUE No http://www-lmmb.ncifcrf.gov/~toms/paper/primer/latex/index.html http://directory.google.com/Top/Science/Math/Applications/Information_Theory/Papers/
  • 24. Decision-Tree Classification Methods  The basic top-down decision tree generation approach usually consists of two phases: 1. Tree construction • At the start, all the training examples are at the root. • Partition examples are recursively based on selected attributes. 2. Tree pruning • Aiming at removing tree branches that may reflect noise in the training data and lead to errors when classifying test data  improve classification accuracy
  • 25. How to Specify Test Condition?  Depends on attribute types – Nominal – Ordinal – Continuous  Depends on number of ways to split – 2-way split – Multi-way split
  • 26. Splitting Based on Nominal Attributes  Multi-way split: Use as many partitions as distinct values. CarType Family Luxury Sports  Binary split: Divides values into two subsets. Need to find optimal partitioning. CarType CarType {Sports, OR {Family, Luxury} {Family} Luxury} {Sports}
  • 27. Splitting Based on Ordinal Attributes  Multi-way split: Use as many partitions as distinct values. Size Small Large Medium  Binary split: Divides values into two subsets. Need to find optimal partitioning. Size Size {Medium, {Small, {Large} OR Large} {Small} Medium} Size {Small,  What about this split? Large} {Medium}
  • 28. Splitting Based on Continuous Attributes  Different ways of handling – Discretization to form an ordinal categorical attribute • Static – discretize once at the beginning • Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering. – Binary Decision: (A < v) or (A ≥ v) • consider all possible splits and finds the best cut • can be more compute intensive
  • 29. Splitting Based on Continuous Attributes
  • 30. Tree Induction  Greedy strategy. – Split the records based on an attribute test that optimizes certain criterion.  Issues – Determine how to split the records • How to specify the attribute test condition? • How to determine the best split? – Determine when to stop splitting
  • 31. How to determine the Best Split Good customers fair customers Customers Income Age <10k >=10k young old
  • 32. How to determine the Best Split  Greedy approach: – Nodes with homogeneous class distribution are preferred  Need a measure of node impurity: High degree Low degree pure of impurity of impurity 50% red 75% red 100% red 50% green 25% green 0% green
  • 33. Measures of Node Impurity  Information gain – Uses Entropy  Gain Ratio – Uses Information Gain and Splitinfo  Gini Index – Used only for binary splits
  • 34. Algorithm for Decision Tree Induction  Basic algorithm (a greedy algorithm) – Tree is constructed in a top-down recursive divide-and-conquer manner – At start, all the training examples are at the root – Attributes are categorical (if continuous-valued, they are discretized in advance) – Examples are partitioned recursively based on selected attributes – Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain)  Conditions for stopping partitioning – All samples for a given node belong to the same class – There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf – There are no samples left
  • 35. Classification Algorithms  ID3 – Uses information gain  C4.5 – Uses Gain Ratio  CART – Uses Gini
  • 36. Entropy: Used by ID3 Entropy(S) = - p log2 p - q log2 q  Entropy measures the impurity of S  S is a set of examples  p is the proportion of positive examples  q is the proportion of negative examples
  • 37. ID3 outlook temperature humidity windy play play sunny hot high FALSE no sunny hot high TRUE no don’t play overcast hot high FALSE yes rainy mild high FALSE yes pno = 5/14 rainy cool normal FALSE yes rainy cool normal TRUE no overcast cool normal TRUE yes sunny mild high FALSE no sunny cool normal FALSE yes rainy mild normal FALSE yes sunny mild normal TRUE yes pyes = 9/14 overcast mild high TRUE yes overcast hot normal FALSE yes rainy mild high TRUE no Impurity = - pyes log2 pyes - pno log2 pno = - 9/14 log2 9/14 - 5/14 log2 5/14 = 0.94 bits
  • 38. ID3 0.94 bits play don’t play al play xim tion 2 don't play play don't play play don't play play don't play ma ma sunny 3 high 3 4 hot 2 2 FALSE 6 2 or overcast 4 0 mild 4 2 infrainy ain 3 g 2 normal 6 1 cool 3 1 TRUE 3 3 outlook humidity temperature windy sunny overcast rainy high normal hot mild cool false true amount of information required to specify class of an example given that it reaches node 0.97 bits 0.0 bits 0.97 bits 0.98 bits 0.59 bits 1.0 bits 0.92 bits 0.81 bits 0.81 bits 1.0 bits * 5/14 * 4/14 * 5/14 * 7/14 * 7/14 * 4/14 * 6/14 * 4/14 * 8/14 * 6/14 + + + + = 0.69 bits = 0.79 bits = 0.91 bits = 0.89 bits gain: 0.25 bits gain: 0.15 bits gain: 0.03 bits gain: 0.05 bits
  • 39. ID3 outlook play don’t play sunny overcast rainy 0.97 bits outlook sunny temperature hot humidity high windy FALSE play no sunny hot high TRUE no sunny mild high FALSE no sunny cool normal FALSE yes al xim tion sunny mild normal TRUE yes ma ma humidity or inf gain temperature windy high normal hot mild cool false true 0.0 bits 0.0 bits 0.0 bits 1.0 bits 0.0 bits 0.92 bits 1.0 bits * 3/5 * 2/5 * 2/5 * 2/5 * 1/5 * 3/5 * 2/5 + + + = 0.0 bits = 0.40 bits = 0.95 bits gain: 0.97 bits gain: 0.57 bits gain: 0.02 bits
  • 40. ID3 outlook play don’t play outlook temperature humidity windy play sunny overcast rainy rainy mild high FALSE yes rainy cool normal FALSE yes 0.97 bits rainy rainy cool mild normal normal TRUE FALSE no yes rainy mild high TRUE no humidity humidity temperature windy high normal high normal hot mild cool false true ∅ 1.0 bits 0.92 bits 0.92 bits 1.0 bits 0.0 bits 0.0 bits *2/5 * 3/5 * 3/5 * 2/5 * 3/5 * 2/5 + + + = 0.95 bits = 0.95 bits = 0.0 bits gain: 0.02 bits gain: 0.02 bits gain: 0.97 bits
  • 41. ID3 outlook temperature humidity windy play sunny hot high FALSE no sunny hot high TRUE no overcast hot high FALSE yes rainy mild high FALSE yes rainy cool normal FALSE yes rainy cool normal TRUE no overcast cool normal TRUE yes sunny mild high FALSE no sunny cool normal FALSE yes rainy mild normal FALSE yes play sunny mild normal TRUE yes overcast overcast mild hot high normal TRUE FALSE yes yes outlook don’t play rainy mild high TRUE no sunny overcast rainy Yes humidity windy high normal false true No Yes Yes No
  • 42. C4.5  Information gain measure is biased towards attributes with a large number of values  C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain) – GainRatio(A) = Gain(A)/SplitInfo(A) v | Dj | | Dj | SplitInfo A ( D ) = −∑ × log 2 ( ) j =1 |D| |D|  Ex. 5 5 4 4 5 5 SplitInfo A ( D) = − ×log 2 ( ) − ×log 2 ( ) − ×log 2 ( ) = 0.926 14 14 14 14 14 14 – gain_ratio(income) = 0.029/0.926 = 0.031  The attribute with the maximum gain ratio is selected as the splitting attribute
  • 43. CART  If a data set D contains examples from n classes, gini index, gini(D) is defined as n 2 gini( D) =1− ∑ p j j =1 where pj is the relative frequency of class j in D  If a data set D is split on A into two subsets D1 and D2, the gini index gini(D) is defined as |D1| |D | gini A ( D) = gini( D1) + 2 gini( D 2) |D| |D|  Reduction in Impurity: ∆gini( A) = gini( D) − giniA ( D)  The attribute provides the smallest ginisplit(D) (or the largest reduction in impurity) is chosen to split the node (need to enumerate all the possible splitting points for each attribute)
  • 44. CART  Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no” 2 2 9 5 gini ( D) = 1 −   −   = 0.459  14   14   Suppose the attribute income partitions D into 10 in D1: {low, medium} and 4 in D2  10  4 giniincome∈{low,medium} ( D ) =  Gini ( D1 ) +  Gini ( D1 )  14   14  but gini{medium,high} is 0.30 and thus the best since it is the lowest  All attributes are assumed continuous-valued  May need other tools, e.g., clustering, to get the possible split values  Can be modified for categorical attributes
  • 45. Comparing Attribute Selection Measures  The three measures, in general, return good results but – Information gain: • biased towards multivalued attributes – Gain ratio: • tends to prefer unbalanced splits in which one partition is much smaller than the others – Gini index: • biased to multivalued attributes • has difficulty when # of classes is large • tends to favor tests that result in equal-sized partitions and purity in both partitions
  • 46. Other Attribute Selection Measures  CHAID: a popular decision tree algorithm, measure based on χ2 test for independence  C-SEP: performs better than info. gain and gini index in certain cases  G-statistics: has a close approximation to χ2 distribution  MDL (Minimal Description Length) principle (i.e., the simplest solution is preferred): – The best tree as the one that requires the fewest # of bits to both (1) encode the tree, and (2) encode the exceptions to the tree  Multivariate splits (partition based on multiple variable combinations) – CART: finds multivariate splits based on a linear comb. of attrs.  Which attribute selection measure is the best? – Most give good results, none is significantly superior than others
  • 47. Underfitting and Overfitting Overfitting Underfitting: when model is too simple, both training and test errors are large
  • 48. Overfitting due to Noise Decision boundary is distorted by noise point
  • 49. Underfitting due to Insufficient Examples Lack of data points in the lower half of the diagram makes it difficult to predict correctly the class labels of that region - Insufficient number of training records in the region causes the decision tree to predict the test examples using other training records that are irrelevant to the classification task
  • 50. Two approaches to avoid Overfitting  Prepruning: – Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold – Difficult to choose an appropriate threshold  Postpruning: – Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees – Use a set of data different from the training data to decide which is the “best pruned tree”
  • 51. Scalable Decision Tree Induction Methods  ID3, C4.5, and CART are not efficient when the training set doesn’t fit the available memory. Instead the following algorithms are used – SLIQ • Builds an index for each attribute and only class list and the current attribute list reside in memory – SPRINT • Constructs an attribute list data structure – RainForest • Builds an AVC-list (attribute, value, class label) – BOAT • Uses bootstrapping to create several small samples
  • 52. BOAT  BOAT (Bootstrapped Optimistic Algorithm for Tree Construction) – Use a statistical technique called bootstrapping to create several smaller samples (subsets), each fits in memory – Each subset is used to create a tree, resulting in several trees – These trees are examined and used to construct a new tree T’ • It turns out that T’ is very close to the tree that would be generated using the whole data set together – Adv: requires only two scans of DB, an incremental alg.
  • 53. Why decision tree induction in data mining?  Relatively faster learning speed (than other classification methods)  Convertible to simple and easy to understand classification rules  Comparable classification accuracy with other methods
  • 54. Converting Tree to Rules Outlook Sunny Overcast Rain Humidity Yes Wind High Normal Strong Weak No Yes No Yes R1: IF (Outlook=Sunny) AND (Humidity=High) THEN Play=No R2: IF (Outlook=Sunny) AND (Humidity=Normal) THEN Play=Yes R3: IF (Outlook=Overcast) THEN Play=Yes R4: IF (Outlook=Rain) AND (Wind=Strong) THEN Play=No R5: IF (Outlook=Rain) AND (Wind=Weak) THEN Play=Yes
  • 55. Decision trees: The Weka tool @relation weather.symbolic @attribute outlook {sunny, overcast, rainy} @attribute temperature {hot, mild, cool} @attribute humidity {high, normal} @attribute windy {TRUE, FALSE} @attribute play {yes, no} @data sunny,hot,high,FALSE,no sunny,hot,high,TRUE,no overcast,hot,high,FALSE,yes rainy,mild,high,FALSE,yes rainy,cool,normal,FALSE,yes rainy,cool,normal,TRUE,no overcast,cool,normal,TRUE,yes sunny,mild,high,FALSE,no sunny,cool,normal,FALSE,yes rainy,mild,normal,FALSE,yes sunny,mild,normal,TRUE,yes overcast,mild,high,TRUE,yes overcast,hot,normal,FALSE,yes rainy,mild,high,TRUE,no http://www.cs.waikato.ac.nz/ml/weka/
  • 56. Bayesian Classifier Thomas Bayes (1702-1761)
  • 57. Basic Statistics Assume • D = All students • X = ICS students • C = SWE students 74 D X 6 4 C 16 |X| = 10 P(X) = 10/100 P(X|C) = P(X,C)/P(C) = 4/20 |C| = 20 P(C) = 20/100 P(C|X) = P(X,C)/P(X) = 4/10 |D| = 100 P(X,C) = 4/100 P(X,C) = P(C|X)*P(X) = P(X|C)*P(C)
  • 58. Bayesian Classifier – Basic Equation P(X,C) = P(C|X)*P(X) = P(X|C)*P(C) Class Prior Probability Descriptor Posterior Probability P( C ) P( X | C ) P( C | X ) = P( X ) Class Posterior Probability Descriptor Prior Probability
  • 59. Naive Bayesian Classifier P ( C | X ) = P( C ) P( X | C ) P( X ) P (C1 ) P( C1 | X ) = P( x1 | C1 ) P( x2 | C1 ) P( x3 | C1 ) .... P( xn | C1 ) P(X) P(C2 ) P( C2 | X ) = P( x1 | C2 ) P( x2 | C2 ) P( x3 | C2 ) .... P( xn | C2 ) P( X) P(Cm ) P( Cm | X ) = P( x1 | Cm ) P( x2 | Cm ) P( x3 | Cm ) .... P( xn | Cm ) P( X) Independence assumption about descriptors
  • 60. Training Data Outlook Temp Humidity Windy Play? sunny hot high FALSE No sunny hot high TRUE No overcast hot high FALSE Yes rainy mild high FALSE Yes rainy cool normal FALSE Yes rainy cool Normal TRUE No overcast cool Normal TRUE Yes sunny mild High FALSE No sunny cool Normal FALSE Yes rainy mild Normal FALSE Yes sunny mild normal TRUE Yes overcast mild High TRUE Yes overcast hot Normal FALSE Yes rainy mild high TRUE No P(yes) = 9/14 P(no) = 5/14
  • 61. Bayesian Classifier – Probabilities for the weather data Frequency Tables Outlook | No Yes Temp. | No Yes Humidity | No Yes Windy | No Yes ---------------------------------- ---------------------------------- ---------------------------------- ---------------------------------- Sunny | 3 2 Hot | 2 2 High | 4 3 False | 2 6 ---------------------------------- ---------------------------------- ---------------------------------- ---------------------------------- Overcast | 0 4 Mild | 2 4 Normal | 1 6 True | 3 3 ---------------------------------- ---------------------------------- Rainy | 2 3 Cool | 1 3 Outlook | No Yes Temp. | No Yes Humidity | No Yes Windy | No Yes ---------------------------------- ---------------------------------- ---------------------------------- ---------------------------------- Sunny | 3/5 2/9 Hot | 2/5 2/9 High | 4/5 3/9 False | 2/5 6/9 ---------------------------------- ---------------------------------- ---------------------------------- ---------------------------------- Overcast | 0/5 4/9 Mild | 2/5 4/9 Normal | 1/5 6/9 True | 3/5 3/9 ---------------------------------- ---------------------------------- Rainy | 2/5 3/9 Cool | 1/5 3/9 Likelihood Tables
  • 62. Bayesian Classifier – Predicting a new day Outlook Temp. Humidity Windy Play X sunny cool high true ? Class? P(yes|X) = p(sunny|yes) x p(cool|yes) x p(high|yes) x p(true|yes) x p(yes) = 2/9 x 3/9 x 3/9 x 3/9 x 9/14 = 0.0053 => 0.0053/(0.0053+0.0206) = 0.205 P(no|X) = p(sunny|no) x p(cool|no) x p(high|no) x p(true|no) x p(no) = 3/5 x 1/5 x 4/5 x 3/5 x 5/14 = 0.0206=0.0206/(0.0053+0.0206) = 0.795
  • 63. Bayesian Classifier – zero frequency problem  What if a descriptor value doesn’t occur with every class value P(outlook=overcast|No)=0  Remedy: add 1 to the count for every descriptor-class combination (Laplace Estimator) Outlook | No Yes Temp. | No Yes Humidity | No Yes Windy | No Yes ---------------------------------- ---------------------------------- ---------------------------------- ---------------------------------- Sunny | 3+1 2+1 Hot | 2+1 2+1 High | 4+1 3+1 False | 2+1 6+1 ---------------------------------- ---------------------------------- ---------------------------------- ---------------------------------- Overcast | 0+1 4+1 Mild | 2+1 4+1 Normal | 1+1 6+1 True | 3+1 3+1 ---------------------------------- ---------------------------------- Rainy | 2+1 3+1 Cool | 1+1 3+1
  • 64. Bayesian Classifier – General Equation P ( X | Ck ) P( Ck ) P ( Ck | X ) = P( X ) Likelihood: P ( X | Ck ) 1  ( x − µ )2  Continues variable: P ( x | C ) = exp−  (2πσ ) 2 1/ 2  2σ 2 
  • 65. Bayesian Classifier – Dealing with numeric attributes
  • 66. Bayesian Classifier – Dealing with numeric attributes
  • 67. Naïve Bayesian Classifier: Comments  Advantages – Easy to implement – Good results obtained in most of the cases  Disadvantages – Assumption: class conditional independence, therefore loss of accuracy – Practically, dependencies exist among variables • E.g., hospitals: patients: Profile: age, family history, etc. Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc. • Dependencies among these cannot be modeled by Naïve Bayesian Classifier  How to deal with these dependencies? – Bayesian Belief Networks
  • 68. Bayesian Belief Networks  Bayesian belief network allows a subset of the variables conditionally independent  A graphical model of causal relationships – Represents dependency among the variables – Gives a specification of joint probability distribution  Nodes: random variables  Links: dependency X Y  X and Y are the parents of Z, and Y is the parent of P Z  No dependency between Z and P P  Has no loops or cycles
  • 69. Bayesian Belief Network: An Example The conditional probability table Family (CPT) for variable LungCancer: Smoker History (FH, S) (FH, ~S) (~FH, S) (~FH, ~S) LC 0.8 0.5 0.7 0.1 ~LC 0.2 0.5 0.3 0.9 LungCancer Emphysema CPT shows the conditional probability for each possible combination of its parents PositiveXRay Dyspnea Derivation of the probability of a particular combination of values of X, from CPT: n Bayesian Belief Networks P ( x1 ,..., xn ) = ∏ P ( x i | Parents (Y i )) i =1
  • 70. Training Bayesian Networks  Several scenarios: – Given both the network structure and all variables observable: learn only the CPTs – Network structure known, some hidden variables: gradient descent (greedy hill-climbing) method, analogous to neural network learning – Network structure unknown, all variables observable: search through the model space to reconstruct network topology – Unknown structure, all hidden variables: No good algorithms known for this purpose.
  • 72. Sabic  Email Mohammed S. Al-Shahrani – shahranims@sabic.com
  • 73. Support Vector Machines  Find a linear hyperplane (decision boundary) that will separate the data
  • 74. Support Vector Machines  One Possible Solution
  • 75. Support Vector Machines  Another possible solution
  • 76. Support Vector Machines  Other possible solutions
  • 77. Support Vector Machines  Which one is better? B1 or B2?  How do you define better?
  • 78. Support Vector Machines  Find a hyper plane that maximizes the margin => B1 is better than B2
  • 79. Support Vectors Support Vectors
  • 80. Support Vector Machines Support Vectors
  • 81. Support Vector Machines   w• x + b = 0   w • x + b = +1   w • x + b = −1    1 if w • x + b ≥ 1 2 f ( x) =    Margin =  2 −1 if w • x + b ≤ −1 || w ||
  • 82. Finding the Decision Boundary  Let {x1, ..., xn} be our data set and let yi ∈ {1,-1} be the class label of xi  The decision boundary should classify all points correctly ⇒  The decision boundary can be found by solving the following constrained optimization problem  This is a constrained optimization problem. Solving it is beyond our course
  • 83. Support Vector Machines 2  We want to maximize: Margin =  2 || w ||  2 || w || – Which is equivalent to minimizing: L( w) = 2 – But subjected to the following constraints:    1 if w • x i + b ≥ 1 f ( xi ) =    −1 if w • x i + b ≤ −1 • This is a constrained optimization problem – Numerical approaches to solve it (e.g., quadratic programming)
  • 84. Classifying new Tuples  The decision boundary is determined only by the support vectors  Let tj (j=1, ..., s) be the indices of the s support vectors.  For testing with a new data z – Compute and classify z as class 1 if the sum is positive, and class 2 otherwise
  • 85. Support Vector Machines Support Vectors
  • 86. Support Vector Machines  What if the training set is not linearly separable?  Slack variables ξi can be added to allow misclassification of difficult or noisy examples, resulting margin called soft. ξi ξi
  • 87. Support Vector Machines  What if the problem is not linearly separable? – Introduce slack variables • Need to minimize:  2 || w ||  N k L( w) = + C ∑ ξi  2  i =1  • Subject to:    1 if w • x i + b ≥ 1 - ξi f ( xi ) =    −1 if w • x i + b ≤ −1 + ξi
  • 88. Nonlinear Support Vector Machines  What if decision boundary is not linear?
  • 89. Non-linear SVMs  Datasets that are linearly separable with some noise work out great: 0 x  But what are we going to do if the dataset is just too hard? 0 x  How about… mapping data to a higher-dimensional space: x2 0 x
  • 90. Non-linear SVMs: Feature spaces  General idea: the original feature space can always be mapped to some higher-dimensional feature space where the training set is separable: Φ: x → φ(x)
  • 92. What Is Prediction?  (Numerical) prediction is similar to classification – construct a model – use model to predict continuous or ordered value for a given input  Prediction is different from classification – Classification refers to predict categorical class label – Prediction models continuous-valued functions  Major method for prediction: regression – model the relationship between one or more predictor variables and a response variable
  • 93. Prediction Response Training data Attribute (Y) Attribute (X) Predictor
  • 94. Types of Correlation Positive correlation Negative correlation No correlation
  • 95. Regression Analysis  Simple Linear regression  multiple regression  Non-linear regression  Other regression methods: – generalized linear model, – Poisson regression, – log-linear models, – regression trees
  • 96. Simple Linear Regression describes the linear relationship between a predictor variable, plotted on the x-axis, and a response variable, plotted on the y-axis Y X
  • 97. Simple Linear Regression Y = βo + β X 1 β1 Y 1.0 βo X
  • 100. Simple Linear Regression Fitting data to a linear model Yi = β o + β1 X i + ε i intercept slope residuals
  • 101. Simple Linear Regression How to fit data to a linear model? Least Square Method
  • 102. Least Squares Regression ˆ Model line: Y = β 0 + β1 X Residual (ε) = Y − Yˆ Sum of squares of residuals = ∑ ˆ (Y − Y ) 2  we must find values of β o and β1 that minimise ∑ ˆ (Y − Y ) 2
  • 103. Linear Regression  A model line: y = w0 + w1 x acquired by using Method of least squares to estimates the best-fitting straight line has: w = y−w x 0 1 | D| ∑( x − x )( yi − y ) w = i i=1 1 ∑( x | D| i − x )2 i=1
  • 104. Multiple Linear Regression  Multiple linear regression: involves more than one predictor variable  The linear model with a single predictor variable X can easily be extended to two or more predictor variables Y = β o + β1 X 1 + β 2 X 2 + ... + β p X p + ε – Solvable by extension of least square method or using SAS, S-Plus
  • 105. Nonlinear Regression  Some nonlinear models can be modeled by a polynomial function  A polynomial regression model can be transformed into linear regression model. For example, y = w0 + w1 x + w2 x2 + w3 x3 convertible to linear with new variables: x2 = x2, x3= x3 y = w0 + w1 x + w2 x2 + w3 x3  Other functions, such as power function, can also be transformed to linear model  Some models are intractable nonlinear – possible to obtain least square estimates through extensive calculation on more complex formulae
  • 107. What is a ANN?  ANN is a data structure that supposedly simulates the behavior of neurons in a biological brain.  ANN is composed of layers of units interconnected.  Messages are passed along the connections from one unit to the other.  Messages can change based on the weight of the connection and the value in the node
  • 108. General Structure of ANN x0 w0 - µk x1 w1 ∑ f xn wn
  • 109. ANN Output Y is 1 if at least two of the three inputs are equal to 1.
  • 110. ANN Y = I (0.3 X 1 + 0.3 X 2 + 0.3 X 3 − 0.4 > 0) 1 if z is true where I ( z ) =  0 otherwise
  • 111. Artificial Neural Networks  Model is an assembly of inter-connected nodes and weighted links  Output node sums up each of its input value according to the weights of its links Perceptron Model  Compare output node Y = I ( ∑wi X i − t ) or against some threshold t i Y = sign( ∑ wi X i − t ) i
  • 112. Neural Networks  Advantages – prediction accuracy is generally high. – robust, works when training examples contain errors. – output may be discrete, real-valued, or a vector of several discrete or real-valued attributes. – fast evaluation of the learned target function.  Criticism – long training time. – difficult to understand the learned function (weights). – not easy to incorporate domain knowledge.
  • 113. Learning Algorithms  Back propagation for classification  Kohonen feature maps for clustering  Recurrent back propagation for classification  Radial basis function for classification  Adaptive resonance theory  Probabilistic neural networks
  • 114. Major Steps for Back Propagation Network  Constructing a network – input data representation – selection of number of layers, number of nodes in each layer.  Training the network using training data  Pruning the network  Interpret the results
  • 115. A Multi-Layer Feed-Forward Neural Network wij I j = ∑ wij Oi + θ j i 1 Oj = −I j 1+ e
  • 116. How A Multi-Layer Neural Network Works?  The inputs to the network correspond to the attributes measured for each training tuple  Inputs are fed simultaneously into the units making up the input layer  They are then weighted and fed simultaneously to a hidden layer  The number of hidden layers is arbitrary, although usually only one  The weighted outputs of the last hidden layer are input to units making up the output layer, which emits the network's prediction  The network is feed-forward in that none of the weights cycles back to an input unit or to an output unit of a previous layer  From a statistical point of view, networks perform nonlinear regression: Given enough hidden units and enough training samples, they can closely approximate any function
  • 117. Defining a Network Topology  First decide the network topology: # of units in the input layer, # of hidden layers (if > 1), # of units in each hidden layer, and # of units in the output layer  Normalizing the input values for each attribute measured in the training tuples to [0.0—1.0]  One input unit per domain value  Output, if for classification and more than two classes, one output unit per class is used  Once a network has been trained and its accuracy is unacceptable, repeat the training process with a different network topology or a different set of initial weights
  • 118. Backpropagation  Iteratively process a set of training tuples & compare the network's prediction with the actual known target value  For each training tuple, the weights are modified to minimize the mean squared error between the network's prediction and the actual target value  Modifications are made in the “backwards” direction: from the output layer, through each hidden layer down to the first hidden layer, hence “backpropagation”  Steps – Initialize weights (to small random #s) and biases in the network – Propagate the inputs forward (by applying activation function) – Backpropagate the error (by updating weights and biases) – Terminating condition (when error is very small, etc.)
  • 119. Backpropagation Err j = O j (1 − O j )∑ Errk w jk k wij = wij + (l ) Err j Oi θ j = θ j + (l) Err j Err j = O j (1 − O j )(T j − O j ) Generated value Correct value
  • 120. Network Pruning  Fully connected network will be hard to articulate  n input nodes, h hidden nodes and m output nodes lead to h(m+n) links (weights)  Pruning: Remove some of the links without affecting classification accuracy of the network.
  • 121. Other Classification Methods  Associative classification : Association rule based condSet class  Genetic algorithm : Initial population of encoded rules are changed by mutation and cross-over based on survival of accurate once (survival).  K-nearest neighbor classifier : Learning by analogy.  Case-based reasoning : Similarity with other cases.  Rough set theory : Approximation to equivalence classes.  Fuzzy sets: Based on fuzzy logic (truth values between 0..1).
  • 123. Lazy vs. Eager Learning  Lazy vs. eager learning – Lazy learning (e.g., instance-based learning): Simply stores training data (or only minor processing) and waits until it is given a test tuple – Eager learning (the above discussed methods): Given a set of training set, constructs a classification model before receiving new (e.g., test) data to classify  Lazy: less time in training but more time in predicting
  • 124. Lazy Learner: Instance-Based Methods  Instance-based learning: – Store training examples and delay the processing (“lazy evaluation”) until a new instance must be classified  Typical approaches – k-nearest neighbor approach • Instances represented as points in a Euclidean space. – Case-based reasoning • Uses symbolic representations and knowledge- based inference
  • 125. Nearest Neighbor Classifiers  Basic idea: – If it walks like a duck, quacks like a duck, then it’s probably a duck Compute Distance Test Record Choose k of the “nearest” records Training records
  • 126. Instance-Based Classifiers • Store the training records • Use training records to predict the class label of unseen cases
  • 127. Definition of Nearest Neighbor X X X (a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor K-nearest neighbors of a record x are data points that have the k smallest distance to x
  • 128. The k-Nearest Neighbor Algorithm  All instances correspond to points in the n-D space  The nearest neighbor are defined in terms of Euclidean distance, dist(X1, X2)  Target function could be discrete- or real- valued  For discrete-valued, k-NN returns the most common value among the k training examples nearest to xq  Vonoroi diagram: the decision surface induced by 1-NN for a typical set of training examples _ _ _ . _ + _ . + + . . . xq _ + .
  • 129. Nearest-Neighbor Classifiers Requires three things – The set of stored records – Distance Metric to compute distance between records – The value of k, the number of nearest neighbors to retrieve To classify an unknown record: – Compute distance to other training records – Identify k nearest neighbors – Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote)
  • 130. Nearest Neighbor Classification  Compute distance between two points: – Euclidean distance d ( p, q ) = ∑( p i i −q ) i 2  Determine the class from nearest neighbor list – take the majority vote of class labels among the k- nearest neighbors – Weigh the vote according to distance • weight factor, w = 1/d2
  • 131. Nearest Neighbor Classification…  Scaling issues – Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes – Example: • height of a person may vary from 1.5m to 1.8m • weight of a person may vary from 90lb to 300lb • income of a person may vary from $10K to $1M
  • 132. Nearest Neighbor Classification…  Choosing the value of k: – If k is too small, sensitive to noise points – If k is too large, neighborhood may include points from other classes
  • 133. Metrics for Performance Evaluation  Focus on the predictive capability of a model – Rather than how fast it takes to classify or build models, scalability, etc.  Confusion Matrix: PREDICTED CLASS a: TP (true positive) Class=Yes Class=No b: FN (false negative) c: FP (false positive) Class=Yes a b ACTUAL d: TN (true negative) CLASS Class=No c d
  • 134. Metrics for Performance Evaluation… PREDICTED CLASS Class=Yes Class=No ACTUAL Class=Yes a b CLASS (TP) (FN) Class=No c d (FP) (TN)  Most widely-used metric: a+d TP + TN Accuracy = = a + b + c + d TP + TN + FP + FN Error Rate = 1 - Accuracy
  • 135. Limitation of Accuracy  Consider a 2-class problem – Number of Class 0 examples = 9990 – Number of Class 1 examples = 10  If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 % – Accuracy is misleading because model does not detect any class 1 example
  • 136. Alternative Classifier Accuracy Measures  accuracy = sensitivity * pos/(pos + neg) + specificity * neg/(pos + neg) – sensitivity = tp/pos /* true positive recognition rate */ – specificity = tn/neg /* true negative recognition rate */  precision = tp/(tp + fp)
  • 137. Predictor Error Measures  Test error (generalization error): the average loss over the test set d – Mean absolute error: ∑| yi − yi ' | i =1 d d – Mean squared error: ∑(y i =1 i − yi ' ) 2 d d ∑y | i −yi ' | – Relative absolute error: i= d 1 ∑y | i=1 i −y | d ∑(y i =1 i − yi ' ) 2 – Relative squared error: d ∑(y i =1 i − y)2 – The mean squared-error exaggerates the presence of outliers Popularly use (square) root mean-square error, similarly, root relative squared error
  • 138. Evaluating Accuracy  Holdout method – Given data is randomly partitioned into two independent sets • Training set (e.g., 2/3) for model construction • Test set (e.g., 1/3) for accuracy estimation – Random sampling: a variation of holdout • Repeat holdout k times, accuracy = avg. of the accuracies obtained  Cross-validation (k-fold, where k = 10 is most popular) – Randomly partition the data into k mutually exclusive subsets, each approximately equal size – At i-th iteration, use Di as test set and others as training set
  • 139. Evaluating Accuracy  Bootstrap – Works well with small data sets – Samples the given training tuples uniformly with replacement  Several boostrap methods, and a common one is .632 boostrap – Suppose we are given a data set of d tuples. The data set is sampled d times, with replacement, resulting in a training set of d samples. The data tuples that did not make it into the training set end up forming the test set. About 63.2% of the original data will end up in the bootstrap, and the remaining 36.8% will form the test set (since (1 – 1/d)d ≈ e-1 = 0.368) – Repeat the sampling procedure k times, overall accuracy of the model: k acc( M ) = ∑ (0.632 × acc( M i ) test _ set +0.368 × acc( M i ) train _ set ) i =1
  • 140. Ensemble Methods  Construct a set of classifiers from the training data  Predict class label of previously unseen records by aggregating predictions made by multiple classifiers – Use a combination of models to increase accuracy – Combine a series of k learned models, M1, M2, …, Mk, with the aim of creating an improved model M*  Popular ensemble methods – Bagging • averaging the prediction over a collection of classifiers – Boosting • weighted vote with a collection of classifiers
  • 142. Bagging: Boostrap Aggregation  Analogy: Diagnosis based on multiple doctors’ majority vote  Training – Given a set D of d tuples, at each iteration i, a training set Di of d tuples is sampled with replacement from D (i.e., boostrap) – A classifier model Mi is learned for each training set Di  Classification: classify an unknown sample X – Each classifier Mi returns its class prediction – The bagged classifier M* counts the votes and assigns the class with the most votes to X  Prediction: can be applied to the prediction of continuous values by taking the average value of each prediction for a given test tuple
  • 143. Bagging: Boostrap Aggregation  Accuracy – Often significant better than a single classifier derived from D – For noise data: not considerably worse, more robust – Proved improved accuracy in prediction
  • 144. Boosting  Analogy: Consult several doctors, based on a combination of weighted diagnoses—weight assigned based on the previous diagnosis accuracy  How boosting works? – Weights are assigned to each training tuple – A series of k classifiers is iteratively learned – After a classifier Mi is learned, the weights are updated to allow the subsequent classifier, Mi+1, to pay more attention to the training tuples that were misclassified by Mi – The final M* combines the votes of each individual classifier, where the weight of each classifier's vote is a function of its accuracy
  • 145. Boosting  The boosting algorithm can be extended for the prediction of continuous values  Comparing with bagging: boosting tends to achieve greater accuracy, but it also risks overfitting the model to misclassified data
  • 146. Boosting: Adaboost  Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)  Initially, all the weights of tuples are set the same (1/d)  Generate k classifiers in k rounds. At round i, – Tuples from D are sampled (with replacement) to form a training set Di of the same size – Each tuple’s chance of being selected is based on its weight – A classification model Mi is derived from Di – Its error rate is calculated using Di as a test set – If a tuple is misclassified, its weight is increased, otherwise it is decreased  Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier Mi error rate is the sum of the weights of the misclassified tuples: d error ( M i ) = ∑ j ×err ( X j ) w j 1 − error ( M i ) log error ( M i )  The weight of classifier Mi’s vote is
  • 147. Summary  Classification Vs prediction  Eager learners – Decision tree – Bayesian – Support vector Machines (SVM) – Neural Networks – Linear regression  Lazy learners – K-Nearest Neighbor (KNN)  Performance (Accuracy) Evaluation – Holdout – Cross validation – Bootstrap  Ensemble Methods – Bagging – Boosting
  • 148. END