SlideShare una empresa de Scribd logo
1 de 109
Descargar para leer sin conexión
Data Mining

Rajendra Akerkar
What Is Data Mining?
n    Data mining (knowledge discovery from data)
      ¨ Extraction of interesting (non-trivial, implicit,
        previously unknown and potentially useful) patterns or
        knowledge from huge amount of data

n    Is everything “data mining”?
      ¨ (Deductive) query processing.
      ¨ Expert systems or small ML/statistical programs


Build computer programs that sift through databases
  automatically, seeking regularities or patterns
July 7, 2009              Data Mining: R. Akerkar            2
Data Mining — What’s in a Name?
                       Information Harvesting
                                                            Knowledge Mining
       Data Mining
                      Knowledge Discovery
                         in Databases                         Data Dredging

                                                Data Archaeology
Data Pattern Processing

                   Database Mining
                                                          Knowledge Extraction
                                      Siftware

    The process of discovering meaningful new correlations, patterns, and
    trends by sifting through large amounts of stored data, using pattern
    recognition technologies and statistical and mathematical techniques

July 7, 2009                    Data Mining: R. Akerkar                          3
Definition
n Several Definitions
   ¨ Non-trivial extraction of implicit, previously unknown
     and potentially useful information from data

    ¨    Exploration & analysis, by automatic or
         semi-automatic means, of
         large quantities of data
         in order to discover
         meaningful patterns



                                            From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

  July 7, 2009               Data Mining: R. Akerkar                                                          4
What is (not) Data Mining?
lWhat is not Data            l   What is Data Mining?
Mining?
                                   – Certain names are more
    – Look up phone                common in certain Indian
    number in phone                states (Joshi, Patil,
    directory                      Kulkarni… in Pune area).
                                   – Group together similar
    – Query a Web                  documents returned by
    search engine for              search engine according to
    information about              their context (e.g. Google
    “Pune”                         Scholar, Amazon.com,).
July 7, 2009            Data Mining: R. Akerkar           5
Origins of Data Mining
n   Draws ideas from machine learning/AI, pattern recognition,
    statistics, and database systems
n   Traditional Techniques
    may be unsuitable due to
                                     Statistics/   Machine Learning/
     ¨ Enormity of data                  AI              Pattern
     ¨ High dimensionality                             Recognition
       of data
                                             Data Mining
     ¨ Heterogeneous,
       distributed nature
       of data                                Database
                                                       systems



    July 7, 2009             Data Mining: R. Akerkar               6
Data Mining Tasks

n    Prediction Methods
       ¨ Use    some variables to predict unknown or
           future values of other variables.

n    Description Methods
       ¨ Find  human-interpretable patterns that
           describe the data.


July 7, 2009             Data Mining: R. Akerkar       7
Data Mining Tasks...

n Classification [Predictive] predicting an item class
n Clustering [Descriptive] finding clusters in data
n Association Rule Discovery [Descriptive]
     frequent occurring events
n    Deviation/Anomaly Detection [Predictive] finding
     changes




July 7, 2009                Data Mining: R. Akerkar      8
Classification: Definition
  n    Given a collection of records (training set )
        ¨ Each        record contains a set of attributes, one of the
               attributes is the class.
  n    Find a model for class attribute as a function
       of the values of other attributes.
  n    Goal: previously unseen records should be
       assigned a class as accurately as possible.
        ¨A        test set is used to determine the accuracy of the
               model. Usually, the given data set is divided into
               training and test sets, with training set used to build
               the model and test set used to validate it.

July 7, 2009                     Data Mining: R. Akerkar                 9
Classification Example
                            l           l          s
                        rica        rica         ou
                      go          go          inu
                    te          te          nt          s
                  ca          ca          co        clas
     Tid Refund Marital            Taxable                             Refund Marital     Taxable
                Status             Income Cheat                               Status      Income Cheat

     1      Yes          Single    125K        No                      No      Single     75K    ?
     2      No           Married   100K        No                      Yes     Married    50K    ?
     3      No           Single    70K         No                      No      Married    150K   ?
     4      Yes          Married   120K        No                      Yes     Divorced 90K      ?
     5      No           Divorced 95K          Yes                     No      Single     40K    ?
     6      No           Married   60K         No                      No      Married    80K    ?        Test
                                               No
                                                                   0
                                                                   1




                                                                                                          Set
     7      Yes          Divorced 220K
     8      No           Single    85K         Yes
     9      No           Married   75K         No                                         Learn
                                                               Training
     10     No           Single    90K         Yes
                                                                                         Classifier      Model
10



                                                                  Set



          July 7, 2009                                      Data Mining: R. Akerkar                              10
Classification: Application 1
n   Direct Marketing
    ¨ Goal:Reduce cost of mailing by targeting a set of
      consumers likely to buy a new cell-phone product.
    ¨ Approach:
           n   Use the data for a similar product introduced before.
           n   We know which customers decided to buy and which decided
               otherwise. This {buy, don’t buy} decision forms the class
               attribute.
           n   Collect various demographic, lifestyle, and company-
               interaction related information about all such customers.
                 ¨   Type of business, where they stay, how much they earn, etc.
           n   Use this information as input attributes to learn a classifier
               model.                           From [Berry & Linoff] Data Mining Techniques, 1997

July 7, 2009                             Data Mining: R. Akerkar                                 11
Classification: Application 2
  n    Fraud Detection
        ¨ Goal: Predict fraudulent cases in credit card
          transactions.
        ¨ Approach:
               n   Use credit card transactions and the information on its
                   account-holder as attributes.
                     ¨   When does a customer buy, what does he buy, how often he
                         pays on time, etc
               n   Label past transactions as fraud or fair transactions. This
                   forms the class attribute.
               n   Learn a model for the class of the transactions.
               n   Use this model to detect fraud by observing credit card
                   transactions on an account.

July 7, 2009                          Data Mining: R. Akerkar                    12
Classification: Application 3
n    Customer Attrition/Churn:
      ¨ Goal:  To predict whether a customer is likely
        to be lost to a competitor.
      ¨ Approach:
               n   Use detailed record of transactions with each of the
                   past and present customers, to find attributes.
                    ¨   How often the customer calls, where he calls, what time-
                        of-the day he calls most, his financial status, marital
                        status, etc.
               n Label the customers as loyal or disloyal.
               n Find a model for loyalty.
                                                         From [Berry & Linoff] Data Mining Techniques, 1997

July 7, 2009                          Data Mining: R. Akerkar                                            13
Decision Tree
Introduction
n    A classification scheme which generates a tree
     and a set of rules from given data set.

n    The set of records available for developing
     classification methods is divided into two disjoint
     subsets – a training set and a test set.
n    The attributes of the records are categorise into
     two types:
     ¨   Attributes whose domain is numerical are called
         numerical attributes.
     ¨   Attributes whose domain is not numerical are called
         the categorical attributes.
July 7, 2009               Data Mining: R. Akerkar             15
Introduction

n    A decision tree is a tree with the following properties:
      ¨ An inner node represents an attribute.
      ¨ An edge represents a test on the attribute of the father
        node.
      ¨ A leaf represents one of the classes.


n    Construction of a decision tree
      ¨ Based on the training data
      ¨ Top-Down strategy




July 7, 2009                Data Mining: R. Akerkar                16
Decision Tree
Example




n    The data set has five attributes.
n    There is a special attribute: the attribute class is the class label.
n    The attributes, temp (temperature) and humidity are numerical
     attributes
n     Other attributes are categorical, that is, they cannot be ordered.

n    Based on the training data set, we want to find a set of rules to know
     what values of outlook, temperature, humidity and wind, determine
     whether or not to play golf.

July 7, 2009                    Data Mining: R. Akerkar                      17
Decision Tree
    Example


n    We have five leaf nodes.
n    In a decision tree, each leaf node represents a rule.

n    We have the following rules corresponding to the tree given in
     Figure.

n    RULE 1       If it is sunny and the humidity is not above 75%, then play.
n    RULE 2       If it is sunny and the humidity is above 75%, then do not play.
n    RULE 3       If it is overcast, then play.
n    RULE 4       If it is rainy and not windy, then play.
n    RULE 5       If it is rainy and windy, then don't play.

July 7, 2009                     Data Mining: R. Akerkar                            18
Classification
n    The classification of an unknown input vector is done by
     traversing the tree from the root node to a leaf node.
n    A record enters the tree at the root node.
n    At the root, a test is applied to determine which child
     node the record will encounter next.
n    This process is repeated until the record arrives at a leaf
     node.
n    All the records that end up at a given leaf of the tree are
     classified in the same way.
n    There is a unique path from the root to each leaf.
n    The path is a rule which is used to classify the records.

July 7, 2009               Data Mining: R. Akerkar             19
n    In our tree, we can carry out the classification for
     an unknown record as follows.
n    Let us assume, for the record, that we know the
     values of the first four attributes (but we do not
     know the value of class attribute) as

n    outlook= rain; temp = 70; humidity = 65; and
     windy= true.

July 7, 2009            Data Mining: R. Akerkar        20
n    We start from the root node to check the value of the
     attribute associated at the root node.
n    This attribute is the splitting attribute at this node.
n    For a decision tree, at every node there is an attribute
     associated with the node called the splitting attribute.

n    In our example, outlook is the splitting attribute at root.
n    Since for the given record, outlook = rain, we move to
     the right-most child node of the root.
n    At this node, the splitting attribute is windy and we find
     that for the record we want classify, windy = true.
n    Hence, we move to the left child node to conclude that
     the class label Is "no play".


July 7, 2009                Data Mining: R. Akerkar                21
n    The accuracy of the classifier is determined by the percentage
      of the test data set that is correctly classified.

 n    We can see that for Rule 1 there are two records of the test
      data set satisfying outlook= sunny and humidity < 75, and
      only one of these is correctly classified as play.
 n    Thus, the accuracy of this rule is 0.5 (or 50%). Similarly, the
      accuracy of Rule 2 is also 0.5 (or 50%). The accuracy of Rule
      3 is 0.66.


RULE 1
If it is sunny and the humidity
is not above 75%, then play.




 July 7, 2009                     Data Mining: R. Akerkar           22
Concept of Categorical Attributes

n   Consider the following training
    data set.
n    There are three attributes,
    namely, age, pincode and
    class.
n   The attribute class is used for
    class label.

     The attribute age is a numeric attribute, whereas pincode is a categorical
     one.
     Though the domain of pincode is numeric, no ordering can be defined
     among pincode values.
        You cannot derive any useful information if one pin-code is greater than
        another pincode.
July 7, 2009                     Data Mining: R. Akerkar                           23
n    Figure gives a decision tree for
     the training data.

n    The splitting attribute at the root
     is pincode and the splitting
     criterion here is pincode = 500
     046.
n    Similarly, for the left child node,
     the splitting criterion is age < 48
                                         At root level, we have 9 records.
     (the splitting attribute is age).
                                         The associated splitting criterion is
                                                pincode = 500 046.
n    Although the right child node               As a result, we split the records
     has the same attribute as the              into two subsets. Records 1, 2, 4, 8,
     splitting attribute, the splitting         and 9 are to the left child note and
     criterion is different.                    remaining to the right node.
                                                The process is repeated at every
                                                node.

July 7, 2009                    Data Mining: R. Akerkar                            24
Advantages and Shortcomings of
Decision Tree Classifications
n    A decision tree construction process is concerned with
     identifying the splitting attributes and splitting criterion at every
     level of the tree.

n    Major strengths are:
     ¨   Decision tree able to generate understandable rules.
     ¨   They are able to handle both numerical and categorical attributes.
     ¨   They provide clear indication of which fields are most important for
         prediction or classification.


n    Weaknesses are:
     ¨   The process of growing a decision tree is computationally expensive. At
         each node, each candidate splitting field is examined before its best split
         can be found.
     ¨   Some decision tree can only deal with binary-valued target classes.
July 7, 2009                       Data Mining: R. Akerkar                        25
Iterative Dichotomizer (ID3)
n    Quinlan (1986)
n    Each node corresponds to a splitting attribute
n    Each arc is a possible value of that attribute.

n    At each node the splitting attribute is selected to be the most
     informative among the attributes not yet considered in the
     path from the root.

n    Entropy is used to measure how informative is a node.
n    The algorithm uses the criterion of information gain to
     determine the goodness of a split.
     ¨   The attribute with the greatest information gain is
         taken as the splitting attribute, and the data set is split
         for all distinct values of the attribute.
July 7, 2009                 Data Mining: R. Akerkar                   26
Training Dataset
  The class label attribute,         This follows an example from Quinlan’s ID3
  buys_computer, has two distinct
  values.
                                         age      income student credit_rating   buys_computer
  Thus there are two distinct          <=30      high       no  fair                  no
  classes. (m =2)                      <=30      high       no  excellent             no
                                       31…40     high       no  fair                  yes
  Class C1 corresponds to yes          >40       medium     no  fair                  yes
  and class C2 corresponds to no.      >40       low       yes fair                   yes
                                       >40       low       yes excellent              no
  There are 9 samples of class yes     31…40     low       yes excellent              yes
  and 5 samples of class no.           <=30      medium     no  fair                  no
                                       <=30      low       yes fair                   yes
                                       >40       medium    yes fair                   yes
                                       <=30      medium    yes excellent              yes
                                       31…40     medium     no  excellent             yes
                                       31…40     high      yes fair                   yes
                                       >40       medium     no  excellent             no




July 7, 2009                         Data Mining: R. Akerkar                               27
Extracting Classification Rules
      from Trees
  n    Represent the knowledge in the
       form of IF-THEN rules
  n    One rule is created for each
       path from the root to a leaf
  n    Each attribute-value pair along
       a path forms a conjunction
  n    The leaf node holds the class
       prediction
  n    Rules are easier for humans to                    What are the rules?
       understand



July 7, 2009                   Data Mining: R. Akerkar                     28
Solution (Rules)

      IF age = “<=30” AND student = “no” THEN buys_computer = “no”


      IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”


      IF age = “31…40”               THEN buys_computer = “yes”


      IF age = “>40” AND credit_rating = “excellent” THEN buys_computer =
         “yes”


      IF age = “<=30” AND credit_rating = “fair” THEN buys_computer = “no”




July 7, 2009                   Data Mining: R. Akerkar                       29
Algorithm for Decision Tree Induction
n    Basic algorithm (a greedy algorithm)
     ¨ Tree is constructed in a top-down recursive divide-and-conquer
       manner
     ¨ At start, all the training examples are at the root
     ¨ Attributes are categorical (if continuous-valued, they are
       discretized in advance)
     ¨ Examples are partitioned recursively based on selected attributes
     ¨ Test attributes are selected on the basis of a heuristic or
       statistical measure (e.g., information gain)
n    Conditions for stopping partitioning
     ¨ All samples for a given node belong to the same class
     ¨ There are no remaining attributes for further partitioning –
       majority voting is employed for classifying the leaf
     ¨ There are no samples left


July 7, 2009                 Data Mining: R. Akerkar                  30
Attribute Selection Measure: Information
   Gain (ID3/C4.5)

   n    Select the attribute with the highest information gain
   n    S contains si tuples of class Ci for i = {1, …, m}
   n    information measures info required to classify any
        arbitrary tuple                         m
                                                    si si
                        I( s1,s2,...,s m ) = − ∑ log 2
   n
                                               i =1 s  s ….information is encoded in bits.
   n    entropy of attribute A with values {a1,a2,…,av}
                                     v
                                          s1 j + ...+ smj
                              E(A)= ∑                     I ( s1 j ,...,smj )
                                     j =1         s

   n    information gained by branching on attribute A

                          Gain(A) = I(s 1, s 2 ,..., s m ) − E(A)
July 7, 2009                        Data Mining: R. Akerkar                            31
Entropy
n    Entropy measures the homogeneity (purity) of a set of examples.
n    It gives the information content of the set in terms of the class labels of the
     examples.
n    Consider that you have a set of examples, S with two classes, P and N. Let the
     set have p instances for the class P and n instances for the class N.
n    So the total number of instances we have is t = p + n. The view [p, n] can be
     seen as a class distribution of S.

The entropy for S is defined as
n    Entropy(S) = - (p/t).log2(p/t) - (n/t).log2(n/t)

n    Example: Let a set of examples consists of 9 instances for class positive, and 5
     instances for class negative.
n    Answer: p = 9 and n = 5.
n    So Entropy(S) = - (9/14).log2(9/14) - (5/14).log2(5/14)
n                  = -(0.64286)(-0.6375) - (0.35714)(-1.48557)
n                  = (0.40982) + (0.53056)
n                  = 0.940




July 7, 2009                         Data Mining: R. Akerkar                       32
Entropy
The entropy for a completely pure set is 0 and is 1 for a set
  with equal occurrences for both the classes.

i.e. Entropy[14,0] = - (14/14).log2(14/14) - (0/14).log2(0/14)
                    = -1.log2(1) - 0.log2(0)
                    = -1.0 - 0
                    =0

i.e. Entropy[7,7] = - (7/14).log2(7/14) - (7/14).log2(7/14)
                   = - (0.5).log2(0.5) - (0.5).log2(0.5)
                   = - (0.5).(-1) - (0.5).(-1)
                   = 0.5 + 0.5
                   =1
July 7, 2009               Data Mining: R. Akerkar               33
Attribute Selection by Information Gain
            Computation                 5      4
    g      Class P: buys_computer = “yes”                        E ( age ) =      I ( 2,3) +      I ( 4, 0 )
                                                                               14            14
    g      Class N: buys_computer = “no”
                                                                                5
    g      I(p, n) = I(9, 5) =0.940                                          +    I (3, 2 ) = 0 .694
    g      Compute the entropy for age:                                        14
         age            pi     ni I(pi, ni)                5
        <=30            2      3 0.971                       I ( 2 ,3 )    means “age <=30” has
                                                          14
        30…40           4      0 0                                5 out of 14 samples, with 2
        >40             3      2 0.971                         yes's and 3 no’s. Hence
  age       income student credit_rating   buys_computer
<=30
<=30
           high
           high
                      no
                      no
                          fair
                          excellent
                                                no
                                                no
                                                            Gain (age ) = I ( p, n) − E (age ) = 0.246
31…40      high       no  fair                  yes
>40        medium     no  fair                  yes                            Gain (income) = 0.029
>40        low       yes  fair                  yes
>40        low       yes  excellent             no           Similarly,        Gain ( student ) = 0.151
31…40      low       yes  excellent             yes
<=30       medium     no  fair                  no                             Gain (credit _ rating ) = 0.048
<=30       low       yes  fair                  yes
>40        medium    yes  fair                  yes
<=30       medium    yes  excellent             yes           Since, age has the highest information gain among
31…40      medium     no  excellent             yes           the attributes, it is selected as the test attribute.
31…40      high      yes  fair                  yes
>40 July   medium
           7, 2009    no  excellent            Data Mining: R. Akerkar
                                                no                                                               34
Exercise 1
n    The following table consists of training data from an employee
     database.




n    Let status be the class attribute. Use the ID3 algorithm to
     construct a decision tree from the given data.


July 7, 2009                 Data Mining: R. Akerkar               35
Solution 1




July 7, 2009   Data Mining: R. Akerkar   36
Other Attribute Selection
  Measures
 n    Gini index (CART, IBM IntelligentMiner)
       ¨ All      attributes are assumed continuous-valued
       ¨ Assume         there exist several possible split values for
               each attribute
       ¨ May       need other tools, such as clustering, to get the
               possible split values
       ¨ Can        be modified for categorical attributes




July 7, 2009                     Data Mining: R. Akerkar                37
Gini Index (IBM IntelligentMiner)
n    If a data set T contains examples from n classes, gini index,
                                            n
     gini(T) is defined as gini ( T ) = 1 − ∑ p 2
                                                               j
                                                    j =1
     where pj is the relative frequency of class j in T.
n    If a data set T is split into two subsets T1 and T2 with sizes N1 and
     N2 respectively, the gini index of the split data contains examples
     from n classes, the gini index gini(T) is defined as


                   gini split   (T ) = N 1 gini (T 1) + N 2 gini (T 2 )
                                       N                N
n    The attribute provides the smallest ginisplit(T) is chosen to split the
     node (need to enumerate all possible splitting points for each
     attribute).

    July 7, 2009                     Data Mining: R. Akerkar              38
Exercise 2




July 7, 2009   Data Mining: R. Akerkar   39
Solution 2
n    SPLIT: Age <= 50
n                ----------------------
n              | High | Low | Total
n                 --------------------
n    S1 (left) | 8 | 11 | 19
n    S2 (right) | 11 | 10 | 21
n                --------------------
n    For S1: P(high) = 8/19 = 0.42 and P(low) = 11/19 = 0.58
n    For S2: P(high) = 11/21 = 0.52 and P(low) = 10/21 = 0.48
n    Gini(S1) = 1-[0.42x0.42 + 0.58x0.58] = 1-[0.18+0.34] = 1-0.52 = 0.48
n    Gini(S2) = 1-[0.52x0.52 + 0.48x0.48] = 1-[0.27+0.23] = 1-0.5 = 0.5
n    Gini-Split(Age<=50) = 19/40 x 0.48 + 21/40 x 0.5 = 0.23 + 0.26 = 0.49

n    SPLIT: Salary <= 65K
n                  ----------------------
n                 | High | Low | Total
n                   --------------------
n    S1 (top)     | 18 | 5 | 23
n    S2 (bottom) | 1 | 16 | 17
n                    --------------------
n    For S1: P(high) = 18/23 = 0.78 and P(low) = 5/23 = 0.22
n    For S2: P(high) = 1/17 = 0.06 and P(low) = 16/17 = 0.94
n    Gini(S1) = 1-[0.78x0.78 + 0.22x0.22] = 1-[0.61+0.05] = 1-0.66 = 0.34
n    Gini(S2) = 1-[0.06x0.06 + 0.94x0.94] = 1-[0.004+0.884] = 1-0.89 = 0.11
n    Gini-Split(Age<=50) = 23/40 x 0.34 + 17/40 x 0.11 = 0.20 + 0.05 = 0.25




July 7, 2009                                   Data Mining: R. Akerkar        40
Exercise 3
n    In previous exercise, which is a better split
     of the data among the two split points?
     Why?




July 7, 2009         Data Mining: R. Akerkar     41
Solution 3
n    Intuitively Salary <= 65K is a better split point since it produces
     relatively ``pure'' partitions as opposed to Age <= 50, which results
     in more mixed partitions (i.e., just look at the distribution of Highs
     and Lows in S1 and S2).

n    More formally, let us consider the properties of the Gini index.
     If a partition is totally pure, i.e., has all elements from the same
     class, then gini(S) = 1-[1x1+0x0] = 1-1 = 0 (for two classes).

    On the other hand if the classes are totally mixed, i.e., both classes
    have equal probability then
    gini(S) = 1 - [0.5x0.5 + 0.5x0.5] = 1-[0.25+0.25] = 0.5.

     In other words the closer the gini value is to 0, the better the partition
     is. Since Salary has lower gini it is a better split.


July 7, 2009                    Data Mining: R. Akerkar                       42
Clustering
Clustering: Definition
n    Given a set of data points, each having a set of
     attributes, and a similarity measure among them,
     find clusters such that
      ¨ Data points in one cluster are more similar to one
        another.
      ¨ Data points in separate clusters are less similar to
        one another.
n    Similarity Measures:
      ¨ Euclidean Distance if attributes are continuous.
      ¨ Other Problem-specific Measures.



July 7, 2009               Data Mining: R. Akerkar             44
Clustering: Illustration
x Euclidean Distance Based Clustering in 3-D space.

               Intracluster distances                   Intercluster distances
                   are minimized                           are maximized




July 7, 2009                       Data Mining: R. Akerkar                       45
Clustering: Application 1

 n    Market Segmentation:
       ¨ Goal:        subdivide a market into distinct subsets of
               customers where any subset may conceivably be
               selected as a market target to be reached with a
               distinct marketing mix.

       ¨ Approach:
                n   Collect different attributes of customers based on their
                    geographical and lifestyle related information.
                n   Find clusters of similar customers.
                n   Measure the clustering quality by observing buying patterns
                    of customers in same cluster vs. those from different
                    clusters.

July 7, 2009                         Data Mining: R. Akerkar                  46
Clustering: Application 2
  n    Document Clustering:
        ¨ Goal:  To find groups of documents that are similar to
          each other based on the important terms appearing
          in them.
        ¨ Approach: To identify frequently occurring terms in
          each document. Form a similarity measure based on
          the frequencies of different terms. Use it to cluster.
        ¨ Gain: Information Retrieval can utilize the clusters to
          relate a new document or search term to clustered
          documents.


July 7, 2009                Data Mining: R. Akerkar            47
k- mean
Clustering

n Clustering is the process of grouping data
  into clusters so that objects within a cluster
  have similarity in comparison to one
  another, but are very dissimilar to objects
  in other clusters.
n The similarities are assessed based on the
  attributes values describing these objects.

July 7, 2009       Data Mining: R. Akerkar    49
The K-Means Clustering Method
  n    Given k, the k-means algorithm is
       implemented in four steps:
       ¨       Partition objects into k nonempty subsets
       ¨       Compute seed points as the centroids of the
               clusters of the current partition (the centroid is the
               center, i.e., mean point, of the cluster)
       ¨       Assign each object to the cluster with the nearest
               seed point
       ¨       Go back to Step 2, stop when no more new
               assignment
July 7, 2009                      Data Mining: R. Akerkar               50
The K-Means Clustering Method
             n    Example
                                                             10                                                                                                        10
10
                                                             9                                                                                                         9
9
                                                             8                                                                                                         8
8
                                                             7                                                                                                         7
7
                                                             6                                                                                                         6
6
                                                             5                                                                                                         5
5
                                                             4                                                                                                         4
4
                                                   Assign    3                                                                                               Update    3

                                                                                                                                                             the
3

                                                   each
                                                             2                                                                                                         2
2

1
                                                   objects
                                                             1
                                                                                                                                                             cluster   1

                                                                                                                                                                       0
                                                                                                                                                             means
                                                             0
0
                                                   to most
                                                                  0        1       2       3       4       5       6       7       8       9       10                       0       1       2       3       4       5       6       7       8       9       10
     0   1    2   3   4   5   6   7   8   9   10

                                                   similar
                                                   center                                                          reassign                                                                                                 reassign
                                                                  10                                                                                                    10

     K=2                                                          9                                                                                                         9

                                                                  8                                                                                                         8

     Arbitrarily choose K                                         7                                                                                                         7


     object as initial
                                                                  6                                                                                                         6

                                                                  5                                                                                                         5

     cluster center                                               4                                                                                          Update         4

                                                                  3

                                                                  2
                                                                                                                                                             the            3

                                                                                                                                                                            2

                                                                  1                                                                                          cluster        1

                                                                  0
                                                                       0       1       2       3       4       5       6       7       8       9        10
                                                                                                                                                             means          0
                                                                                                                                                                                0       1       2       3       4       5       6       7       8       9    10




             July 7, 2009                                                          Data Mining: R. Akerkar                                                                                                                                      51
K-Means Clustering
n    K-means is a partition based clustering
     algorithm.
n    K-means’ goal: Partition database D into K parts,
     where there is little similarity across groups, but
     great similarity within a group. More specifically,
     K-means aims to minimize the mean square
     error of each point in a cluster, with respect to its
     cluster centroid.


July 7, 2009            Data Mining: R. Akerkar         52
K-Means Example                                         A1

                                                        2

                                                        4
n    Consider the following one-dimensional             10
     database with attribute A1.
                                                        12

                                                        3
n    Let us use the k-means algorithm to partition
                                                        20
     this database into k = 2 clusters. We begin by
                                                        30
     choosing two random starting points, which
     will serve as the centroids of the two clusters.   11

                                                        25

                        µ C1 = 2

                        µ C2 = 4

July 7, 2009                Data Mining: R. Akerkar          53
Cluster      A1
n    To form clusters, we assign each            Assignment

     point in the database to the
                                                   C1         2
     nearest centroid.
                                                   C2         4
n    For instance, 10 is closer to c2              C2         10
     than to c1.                                   C2         12

n    If a point is the same distance               C1         3

     from two centroids, such as point             C2         20

     3 in our example, we make an                  C2         30

                                                   C2         11
     arbitrary assignment.
                                                   C2         25




July 7, 2009           Data Mining: R. Akerkar                54
n    Once all points have been assigned, we
     recompute the means of the clusters.

                                    2+3
                           µ C1 =       = 2 .5
                                     2
                        4 + 10 + 12 + 20 + 30 + 11 + 25 112
               µ C2 =                                  =    = 16
                                       7                 7




July 7, 2009                        Data Mining: R. Akerkar        55
n    We then reassign each point to the two        Clusters   A1

     clusters based on the new means.                C1       2

                                                     C1       4

n    Remark: point 4 now belongs to cluster          C2       10

     C1.                                             C2       12

                                                     C1       3

n    The steps are repeated until the means          C2       20

     converge to their optimal values. In            C2       30
     each iteration, the means are re-               C2       11
     computed and all points are reassigned.         C2       25




July 7, 2009             Data Mining: R. Akerkar              56
n    In this example, only one more iteration is
     needed before the means converge. We
     compute the new means:
                               2+3+ 4
                      µ C1 =          =3
                                 3

                               10 + 12 + 20 + 30 + 11 + 25 108
                      µ C2   =                            =    = 18
                                            6               6



       Now if we reassign the points there is no change in the clusters.
       Hence the means have converged to their optimal values and
       the algorithm terminates.


July 7, 2009                      Data Mining: R. Akerkar                  57
Visualization of k-means
algorithm




July 7, 2009   Data Mining: R. Akerkar   58
Exercise
n Apply the K-means algorithm for the
  following 1-dimensional points (for k=2): 1;
  2; 3; 4; 6; 7; 8; 9.
n Use 1 and 2 as the starting centroids.




July 7, 2009      Data Mining: R. Akerkar    59
Solution
Iteration #1
1: 1 mean = 1
2: 2,3,4,6,7,8,9          mean = 5.57

Iteration #2
1: 1,2,3 mean = 2
5.57: 4,6,7,8,9           mean = 6.8

Iteration #3
2: 1,2,3,4                 mean = 2.5
6.8: 6,7,8,9               mean = 7.5

Iteration #4
2.5: 1,2,3,4               mean = 2.5
7.5: 6,7,8,9               mean = 7.5

Means haven’t changed, so stop iterating.
The final clusters are {1,2,3,4} and {6,7,8,9}.

July 7, 2009                     Data Mining: R. Akerkar   60
K – Mean for 2-dimensional
database
n    Let us consider {x1, x2, x3, x4, x5} with following coordinates as
     two-dimensional sample for clustering:

n    x1 = (0, 2), x2 = (0, 0), x3 = (1.5,0), x4 = (5,0), x5 = (5, 2)

n    Suppose that required number of clusters is 2.
n    Initially, clusters are formed from random distribution of
     samples:
n    C1 = {x1, x2, x4} and C2 = {x3, x5}.



July 7, 2009                   Data Mining: R. Akerkar                 61
Centroid Calculation
n    Suppose that the given set of N samples in an n-dimensional
     space has somehow be partitioned into K clusters {C1, C2, …,
     Ck}
n    Each Ck has nk samples and each sample is exactly in one
     cluster.
n    Therefore, Σ nk = N, where k = 1, …, K.
                                nk
n    The mean vector Mk of cluster Ck is defined as centroid of the
     cluster,                              Where xik is the ith sample belonging

                                            Σi = 1 xik
                                                          to cluster Ck.
                           Mk = (1/ nk)

n    In our example, The centroids for these two clusters are
n    M1 = {(0 + 0 + 5)/3, (2 + 0 + 0)/3} = {1.66, 0.66}
n    M2 = {( 1.5 + 5)/2, (0 +2)/2} = {3.25, 1.00}

July 7, 2009                    Data Mining: R. Akerkar                    62
The Square-error of the cluster

n    The square-error for cluster Ck is the sum of squared
     Euclidean distances between each sample in Ck and its
     centroid.
n    This error is called the within-cluster variation.
                                nk

                        ek2 =   Σi = 1 (xik – Mk)2
n    Within cluster variations, after initial random distribution of
     samples, are
n    e12 = [(0 – 1.66)2 + (2 – 0.66)2] + [(0 – 1.66)2 + (0 – 0.66)2]
            + [(5 – 1.66)2 + (0 – 0.66)2] = 19.36
n    e22 = [(1.5 – 3.25)2 + (0 – 1)2] + [(5 – 3.25)2 + (2 – 1)2] = 8.12
July 7, 2009                    Data Mining: R. Akerkar                   63
Total Square-error
n    The square error for the entire clustering space
     containing K clusters is the sum of the within-cluster
     variations. K
               Ek2 =   Σk = 1 ek2

n    The total square error is
     E2 = e12 + e22 = 19.36 + 8.12 = 27.48

July 7, 2009                   Data Mining: R. Akerkar    64
n    When we reassign all samples, depending on a minimum
     distance from centroids M1 and M2, the new redistribution of
     samples inside clusters will be,
n    d(M1, x1) = (1.662 + 1.342)1/2 = 2.14 and d(M2, x1) = 3.40 ⇒ x1
     ∈ C1
n    d(M1, x2) = 1.79                      and d(M2, x2) = 3.40 ⇒ x2
     ∈ C1 d(M1, x3) = 0.83                      and d(M2, x3) = 2.01
     ⇒ x3 ∈ C1 d(M1, x4) = 3.41                      and d(M2, x4) =
     2.01 ⇒ x4 ∈ C2 d(M1, x5) = 3.60                       and d(M2,
     x5) = 2.01 ⇒ x5 ∈ C2

      Above calculation is based on Euclidean distance formula,
                                      m

                         d(xi, xj) = Σk = 1 (xik – xjk)1/2


July 7, 2009                       Data Mining: R. Akerkar        65
n    New Clusters C1 = {x1, x2, x3} and C2 = {x4, x5} have new
     centroids
n    M1 = {0.5, 0.67}
n    M2 = {5.0, 1.0}

n The corresponding within-cluster variations and the total
  square error are,
n e12 = 4.17
n e22 = 2.00

n E2 = 6.17


July 7, 2009                Data Mining: R. Akerkar              66
The cluster membership
stabilizes…
n    After the first iteration, the total square error is
     significantly reduced: ( from 27.48 to 6.17)

n    In this example, if we analysis the distances
     between the new centroids and the samples, the
     second iteration will be assigned to the same
     clusters.

n    Thus no further reassignment and algorithm
     halts.
July 7, 2009             Data Mining: R. Akerkar            67
Variations of the K-Means Method
n   A few variants of the k-means which differ in

    ¨   Selection of the initial k means

    ¨   Strategies to calculate cluster means

n   Handling categorical data: k-modes (Huang’98)

    ¨   Replacing means of clusters with modes

    ¨   Using new dissimilarity measures to deal with categorical objects

    ¨   Using a frequency-based method to update modes of clusters

    ¨   A mixture of categorical and numerical data: k-prototype method



July 7, 2009                    Data Mining: R. Akerkar                   68
What is the problem of k-Means
    Method?
n   The k-means algorithm is sensitive to outliers !

     ¨   Since an object with an extremely large value may substantially
         distort the distribution of the data.

n   K-Medoids: Instead of taking the mean value of the object in a cluster
    as a reference point, medoids can be used, which is the most
    centrally located object in a cluster.
                 10                                                                 10
                  9                                                                 9
                  8                                                                 8
                  7                                                                 7
                  6                                                                 6
                  5                                                                 5
                  4                                                                 4
                  3                                                                 3
                  2                                                                 2
                  1                                                                 1
                  0                                                                 0
                      0   1   2   3   4   5   6   7   8   9   10                         0   1   2   3   4   5   6   7   8   9   10


July 7, 2009                                              Data Mining: R. Akerkar                                                     69
Exercise 2
n    Let the set X consist of the following sample points in 2 dimensional
     space:

n    X = {(1, 2), (1.5, 2.2), (3, 2.3), (2.5,-1), (0, 1.6), (-1,1.5)}

n    Let c1 = (1.5, 2.5) and c2 = (3, 1) be initial estimates of centroids for X.
n    What are the revised values of c1 and c2 after 1 iteration of k-means
     clustering (k = 2)?




July 7, 2009                        Data Mining: R. Akerkar                     70
Solution 2
n    For each data point, calculate the distance to
     each centroid:

                x    y     d(xi,c1)                d(xi,c2)
x1             1     2     0.707107               2.236068
x2             1.5   2.2   0.3                    1.920937
x3             3     2.3   1.513275               1.3
x4             2.5   -1    3.640055                2.061553
x5             0     1.6   1.749286                3.059412
x6             -1    1.5   2.692582                4.031129


July 7, 2009                 Data Mining: R. Akerkar          71
n    It follows that x1, x2, x5 and x6 are closer to c1 and the
     other points are closer to c2. Hence replace c1 with the
     average of x1, x2, x5 and x6 and replace c2 with the
     average of x3 and x4. This gives:

n    c1’ = (0.375, 1.825)
n    c2’ = (2.75, 0.65)




July 7, 2009                Data Mining: R. Akerkar               72
Association Rule Discovery




July 7, 2009   Data Mining: R. Akerkar   73
Market-basket problem.

n    We are given a set of items and a large collection of
     transactions, which are subsets (baskets) of these items.

n    Task: To find relationships between the presences of
     various items within these baskets.

n    Example: To analyze customers' buying habits by finding
     associations between the different items that customers
     place in their shopping baskets.


July 7, 2009              Data Mining: R. Akerkar           74
Associations discovery

n    Finding frequent patterns, associations, correlations, or
     causal structures among sets of items or objects in
     transaction databases, relational databases, and other
     information repositories

      ¨   Associations discovery uncovers affinities amongst collection of
          items
      ¨   Affinities are represented by association rules
      ¨   Associations discovery is an unsupervised approach to data
          mining.



July 7, 2009                    Data Mining: R. Akerkar                  75
Association Rule : Application 2

  n    Supermarket shelf management.
        ¨ Goal:  To identify items that are bought together by
          sufficiently many customers.
        ¨ Approach: Process the point-of-sale data collected
          with barcode scanners to find dependencies among
          items.
        ¨A         classic rule --
               n   If a customer buys diaper and milk, then he is very likely to
                   buy beer.
               n   So, don’t be surprised if you find six-packs stacked next to
                   diapers!


July 7, 2009                        Data Mining: R. Akerkar                    76
Association Rule : Application 3

n    Inventory Management:
      ¨ Goal: A consumer appliance repair company
        wants to anticipate the nature of repairs on its
        consumer products and keep the service
        vehicles equipped with right parts to reduce
        on number of visits to consumer households.
      ¨ Approach: Process the data on tools and
        parts required in previous repairs at different
        consumer locations and discover the co-
        occurrence patterns.

July 7, 2009            Data Mining: R. Akerkar        77
What is a rule?

n         The rule in a rule induction system comes in
          the form “If this and this and this then this”
n         For a rule to be useful two pieces of
          information are needed
     1.        Accuracy (The lower the accuracy the closer the rule comes to
               random guessing)
     2.        Coverage (how often you can use a useful rule)
n         A Rule consists of two parts
     1.        The antecedent or the LHS
     2.        The consequent or the RHS.

July 7, 2009                       Data Mining: R. Akerkar                     78
An example
Rule                                     Accuracy   Coverage
If breakfast cereal purchased            85%        20%
then milk will be purchased

If bread purchased, then swiss           15%        6%
cheese will be purchased

If 42 years old and purchased            95%        0.01%
dry roasted peanuts, beer
purchased



July 7, 2009           Data Mining: R. Akerkar              79
What is association rule mining?




July 7, 2009   Data Mining: R. Akerkar   80
Frequent Itemset




July 7, 2009   Data Mining: R. Akerkar   81
Support and Confidence




July 7, 2009   Data Mining: R. Akerkar   82
What to do with a rule?

n       Target the antecedent
n       Target the consequent
n       Target based on accuracy
n       Target based on coverage
n       Target based on “interestingness”

n       Antecedent can be one or more conditions all of which must be true in
        order for the consequent to be true at the given accuracy.
n       Generally the consequent is just a simple condition (eg purchasing one
        grocery store item) rather than multiple items.


July 7, 2009                     Data Mining: R. Akerkar                         83
•         All rules that have a certain value for the antecedent are gathered
          and presented to the user.
      •       For example the grocery store may request all rules that have
              nails or bolts or screws in the antecedent and try and conclude
              whether discontinuing sales of these lower priced items will
              have any effect on the higher margin items like hammers.


•         All rules that have a certain value for the consequent are
          gathered. Can be used to understand what affects the
          consequent.
      •       For instance might be useful to know what rules have “coffee”
              in their RHS. Store owner might want to put coffee close to
              other items in order to increase sales of both items, or a
              manufacturer may determine in which magazine to place next
              coupons.




July 7, 2009                     Data Mining: R. Akerkar                   84
•         Sometimes accuracy most important. Highly accurate rules of 80
          or 90% of the time imply strong relationships even if the coverage
          is very low.
      •        For example lets say a rule can only be applied one time out of 1000 but if this
               rule is very profitable the one time then it can be worthwhile. This is how most
               successful data mining applications work in the financial markets looking for
               that limited amount of time in which a very confident prediction can be made.


•         Sometimes users want to know the rules that are most widely
          applicable. By looking at rules ranked by coverage they can get a
          high level view of what is happening in the database most of the
          time

•         Rules are interesting when they have high coverage and high
          accuracy but they deviate from the norm. Eventually there may be
          a tradeoff between coverage and accuracy can be made using a
          measure of interestingness


July 7, 2009                            Data Mining: R. Akerkar                             85
Evaluating and using rules

n    Look at simple statistics.
n    Using conjunctions and disjunctions
n    Defining “interestingness”
n    Other Heuristics




July 7, 2009          Data Mining: R. Akerkar   86
Using conjunctions and
disjunctions

n    This dramatically increases or decreases the
     coverage. For example
      ¨ If  diet soda or regular soda or beer then potato chips,
          covers a lot more shopping baskets than just one of
          the constraints by themselves.




July 7, 2009                Data Mining: R. Akerkar            87
Defining “interestingness”
n      Interestingness must have 4 basic behaviors
     1.        Interestingness=0. Rule accuracy is equal to
               background (a priori probability of the LHS), then
               discard rule.
     2.        Interestingness increases as accuracy increases if
               coverage fixed
     3.        Interestingness increases or decreases with
               coverage if accuracy stays fixed.
     4.        Interestingness decreases with coverage for a fixed
               number of correct responses.



July 7, 2009                   Data Mining: R. Akerkar           88
Other Heuristics
   n    Look at the actual number of records covered
        and not as a probability or a percentage.
   n    Compare a given pattern to random chance.
        This will be an “out of the ordinary measure”.
   n    Keep it simple




July 7, 2009            Data Mining: R. Akerkar          89
Example




       Here t supports items C, DM, and CO. The item DM is supported by 4
           out of 6 transactions in T. Thus, the support of DM is 66.6%.

July 7, 2009                    Data Mining: R. Akerkar                     90
Definition




July 7, 2009   Data Mining: R. Akerkar   91
Association Rules
n    Algorithms that obtain association rules
     from data usually divide the task into two
     parts:
      ¨ findthe frequent itemsets and
      ¨ form the rules from them.




July 7, 2009           Data Mining: R. Akerkar    92
Association Rules
n    The problem of mining association rules
     can be divided into two subproblems:




July 7, 2009        Data Mining: R. Akerkar    93
Definitions




July 7, 2009   Data Mining: R. Akerkar   94
a priori algorithm
n    Agrawal and Srikant in 1994.
n    It is also called the level-wise algorithm.
      ¨ It is the most accepted algorithm for finding all the
        frequent sets.
      ¨ It makes use of the downward closure property.
      ¨ The algorithm is a bottom-up search, progressing
        upward level-wise in the lattice.
n    The interesting fact –
      ¨    before reading the database at every level, it prunes
          many of the sets, which are unlikely to be frequent
          sets.
July 7, 2009                Data Mining: R. Akerkar             95
a priori algorithm




July 7, 2009   Data Mining: R. Akerkar   96
a priori candidate-generation method




July 7, 2009   Data Mining: R. Akerkar   97
Pruning algorithm




July 7, 2009   Data Mining: R. Akerkar   98
a priori Algorithm




July 7, 2009   Data Mining: R. Akerkar   99
Exercise 3
Suppose that L3 is the list
      {{a,b,c}, {a,b,d}. {a,c,d}, {b,c,d}. {b,c,w}, {b,c,x},
        {p,q,r}, {p,q,s}, {p,q,t}, {p,r,s}, {q,r,s}}
Which itemsets are placed in C4 by the join
 step of the Apriori algorithm? Which are
 then removed by the prune step?



July 7, 2009              Data Mining: R. Akerkar          100
Solution3
n    At the join step of Apriori Algorithm, each
     member (set) is compared with every other
     member.
n    If all the elements of the two members are
     identical except the right most ones, the union of
     the two sets is placed into C4.
n    For the members of L3 given the following sets
     of four elements are placed into C4:
      {a,b,c,d}, {b,c,d,w}, {b,c,d,x}, {b,c,w,x}, {p,q,r,s}, {p,q,r,t}
        and {p,q,s,t}.


July 7, 2009                 Data Mining: R. Akerkar                101
Solution3 (continued)
n    At the prune step of the algorithm, each member of C4 is
     checked to see whether all its subsets of 3 elements are
     members of L3.
n    The result in this case is as follows:




July 7, 2009             Data Mining: R. Akerkar          102
Solution3 (continued)
n    Therefore,
      {b,c,d,w}, {b,c,d,x}, {b,c,w,x}, {p,q,r,t} and
        {p.q.s.t}
      are removed by the prune step


n    Leaving C4 as,
      {{a,b,c,d}, {p,q,r,s}}



July 7, 2009              Data Mining: R. Akerkar      103
Exercise 4
n    Given a dataset with four attributes w, x, y
     and z, each with three values, how many
     rules can be generated with one term on
     the right-hand side?




July 7, 2009         Data Mining: R. Akerkar    104
Solution 4
 n    Let us assume that the attribute w has 3 values w1, w2,
      and w3, and similarly for x, y, and z.

 n    If we select arbitrarily attribute w to be on the right-hand
      side of each rule, there are 3 possible types of rule:
       ¨ IF…THEN w=w1
       ¨ IF…THEN w=w2
       ¨ IF…THEN w=w3


 n    Now choose one of these rules, say the first, and
      calculate how many possible left hand sides there are
      for such rules.

July 7, 2009                Data Mining: R. Akerkar             105
Solution 4 (continued)
n    The number of “attribute=value” terms on
     the LHS can be 1, 2, or 3.

n    Case I: One trem on LHS
      ¨ There   are 3 possible terms: x, y, and z. Each
          has 3 possible values, so there are 3x3=9
          possible LHS, e.g. IF x=x1.



July 7, 2009             Data Mining: R. Akerkar      106
Solution 4 (continued)
n    Case II: 2 terms on LHS
      ¨ There  are 3 ways in which combination of 2
        attributes may appear on the LHS: x and y, y
        and z, and x and z.
      ¨ Each attribute has 3 values, so for each pair
        there are 3x3=9 possible LHS, e.g. IF x=x1
        AND y=y1
      ¨ There are 3 possible pairs of attributes, so the
        totle number of possible LHS is 3x9=27.

July 7, 2009            Data Mining: R. Akerkar       107
Solution 4 (continued)
n    Case III: 3 terms on LHS
      ¨   All 3 attributes x, y and z must be on LHS.
      ¨   Each has 3 values, so 3x3x3=27 possible LHS, e.g.
           IF x=x1 AND y=y1 AND z=z1.

      ¨   Thus for each of the 3 possible “w=value” terms on the RHS, the
          total number of LHS with 1,2 or 3 terms is 9+27+27=63.

      ¨   So there are 3x63 = 189 possible rules with attribute w on the
          RHS.

      ¨   The attribute on the RHS could be any of four possibilities (not
          just w). Therefore total possible number of rules is 4x189=756.


July 7, 2009                    Data Mining: R. Akerkar                    108
References
n   R. Akerkar and P. Lingras. Building an Intelligent Web: Theory &
    Practice, Jones & Bartlett, 2008 (In India: Narosa Publishing House,
    2009)
n   U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy.
    Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press,
    1996
n   U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data
    Mining and Knowledge Discovery, Morgan Kaufmann, 2001
n   J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan
    Kaufmann, 2001
n   D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT
    Press, 2001




July 7, 2009                             Data Mining: R. Akerkar                109

Más contenido relacionado

La actualidad más candente

Predictive Analytics - An Overview
Predictive Analytics - An OverviewPredictive Analytics - An Overview
Predictive Analytics - An OverviewMachinePulse
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and predictionDataminingTools Inc
 
Classification
ClassificationClassification
ClassificationCloudxLab
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data AnalyticsUtkarsh Sharma
 
Classification and regression trees (cart)
Classification and regression trees (cart)Classification and regression trees (cart)
Classification and regression trees (cart)Learnbay Datascience
 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based ClusteringSSA KPI
 
A review of machine learning based anomaly detection
A review of machine learning based anomaly detectionA review of machine learning based anomaly detection
A review of machine learning based anomaly detectionMohamed Elfadly
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsMd. Main Uddin Rony
 
Lect4 principal component analysis-I
Lect4 principal component analysis-ILect4 principal component analysis-I
Lect4 principal component analysis-Ihktripathy
 
Data Mining: What is Data Mining?
Data Mining: What is Data Mining?Data Mining: What is Data Mining?
Data Mining: What is Data Mining?Seerat Malik
 
DBSCAN : A Clustering Algorithm
DBSCAN : A Clustering AlgorithmDBSCAN : A Clustering Algorithm
DBSCAN : A Clustering AlgorithmPınar Yahşi
 
Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)Mustafa Sherazi
 
Introduction to Data mining
Introduction to Data miningIntroduction to Data mining
Introduction to Data miningHadi Fadlallah
 
3.3 hierarchical methods
3.3 hierarchical methods3.3 hierarchical methods
3.3 hierarchical methodsKrish_ver2
 

La actualidad más candente (20)

Predictive Analytics - An Overview
Predictive Analytics - An OverviewPredictive Analytics - An Overview
Predictive Analytics - An Overview
 
Dbscan algorithom
Dbscan algorithomDbscan algorithom
Dbscan algorithom
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and prediction
 
Classification
ClassificationClassification
Classification
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
 
Classification and regression trees (cart)
Classification and regression trees (cart)Classification and regression trees (cart)
Classification and regression trees (cart)
 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based Clustering
 
A review of machine learning based anomaly detection
A review of machine learning based anomaly detectionA review of machine learning based anomaly detection
A review of machine learning based anomaly detection
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
 
Data Mining
Data MiningData Mining
Data Mining
 
Missing data handling
Missing data handlingMissing data handling
Missing data handling
 
Lect4 principal component analysis-I
Lect4 principal component analysis-ILect4 principal component analysis-I
Lect4 principal component analysis-I
 
Data mining
Data mining Data mining
Data mining
 
Data Mining: Data Preprocessing
Data Mining: Data PreprocessingData Mining: Data Preprocessing
Data Mining: Data Preprocessing
 
Data Mining: What is Data Mining?
Data Mining: What is Data Mining?Data Mining: What is Data Mining?
Data Mining: What is Data Mining?
 
DBSCAN : A Clustering Algorithm
DBSCAN : A Clustering AlgorithmDBSCAN : A Clustering Algorithm
DBSCAN : A Clustering Algorithm
 
Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)Clustering in data Mining (Data Mining)
Clustering in data Mining (Data Mining)
 
Reports vs analysis
Reports vs analysisReports vs analysis
Reports vs analysis
 
Introduction to Data mining
Introduction to Data miningIntroduction to Data mining
Introduction to Data mining
 
3.3 hierarchical methods
3.3 hierarchical methods3.3 hierarchical methods
3.3 hierarchical methods
 

Destacado

Data mining slides
Data mining slidesData mining slides
Data mining slidessmj
 
Why Beyoncé Is More Popular Than Me – Fairness, Diversity and Other Measures
Why Beyoncé Is More Popular Than Me – Fairness, Diversity and Other MeasuresWhy Beyoncé Is More Popular Than Me – Fairness, Diversity and Other Measures
Why Beyoncé Is More Popular Than Me – Fairness, Diversity and Other MeasuresJérôme KUNEGIS
 
A review on data mining
A  review on data miningA  review on data mining
A review on data miningEr. Nancy
 
Fast Data Mining: Real Time Knowledge Discovery for Predictive Decision Making
Fast Data Mining: Real Time Knowledge Discovery for Predictive Decision MakingFast Data Mining: Real Time Knowledge Discovery for Predictive Decision Making
Fast Data Mining: Real Time Knowledge Discovery for Predictive Decision MakingCodemotion
 
Semantic Markup
Semantic Markup Semantic Markup
Semantic Markup R A Akerkar
 
What is Big Data ?
What is Big Data ?What is Big Data ?
What is Big Data ?R A Akerkar
 
Linked open data
Linked open dataLinked open data
Linked open dataR A Akerkar
 
Description logics
Description logicsDescription logics
Description logicsR A Akerkar
 
Big data in Business Innovation
Big data in Business Innovation   Big data in Business Innovation
Big data in Business Innovation R A Akerkar
 
Knowledge Organization Systems
Knowledge Organization SystemsKnowledge Organization Systems
Knowledge Organization SystemsR A Akerkar
 
Statistical Preliminaries
Statistical PreliminariesStatistical Preliminaries
Statistical PreliminariesR A Akerkar
 
Intelligent natural language system
Intelligent natural language systemIntelligent natural language system
Intelligent natural language systemR A Akerkar
 
Big data: analyzing large data sets
Big data: analyzing large data setsBig data: analyzing large data sets
Big data: analyzing large data setsR A Akerkar
 
Big Data and Harvesting Data from Social Media
Big Data and Harvesting Data from Social MediaBig Data and Harvesting Data from Social Media
Big Data and Harvesting Data from Social MediaR A Akerkar
 
Can You Really Make Best Use of Big Data?
Can You Really Make Best Use of Big Data?Can You Really Make Best Use of Big Data?
Can You Really Make Best Use of Big Data?R A Akerkar
 
Chapter 5 decision tree induction using frequency tables for attribute selection
Chapter 5 decision tree induction using frequency tables for attribute selectionChapter 5 decision tree induction using frequency tables for attribute selection
Chapter 5 decision tree induction using frequency tables for attribute selectionKy Hong Le
 
Your amazing brain assembly
Your amazing brain assemblyYour amazing brain assembly
Your amazing brain assemblyHighbankPrimary
 

Destacado (20)

Data mining slides
Data mining slidesData mining slides
Data mining slides
 
Data mining
Data miningData mining
Data mining
 
Data mining
Data miningData mining
Data mining
 
Big Data v Data Mining
Big Data v Data MiningBig Data v Data Mining
Big Data v Data Mining
 
Why Beyoncé Is More Popular Than Me – Fairness, Diversity and Other Measures
Why Beyoncé Is More Popular Than Me – Fairness, Diversity and Other MeasuresWhy Beyoncé Is More Popular Than Me – Fairness, Diversity and Other Measures
Why Beyoncé Is More Popular Than Me – Fairness, Diversity and Other Measures
 
A review on data mining
A  review on data miningA  review on data mining
A review on data mining
 
Fast Data Mining: Real Time Knowledge Discovery for Predictive Decision Making
Fast Data Mining: Real Time Knowledge Discovery for Predictive Decision MakingFast Data Mining: Real Time Knowledge Discovery for Predictive Decision Making
Fast Data Mining: Real Time Knowledge Discovery for Predictive Decision Making
 
Semantic Markup
Semantic Markup Semantic Markup
Semantic Markup
 
What is Big Data ?
What is Big Data ?What is Big Data ?
What is Big Data ?
 
Linked open data
Linked open dataLinked open data
Linked open data
 
Description logics
Description logicsDescription logics
Description logics
 
Big data in Business Innovation
Big data in Business Innovation   Big data in Business Innovation
Big data in Business Innovation
 
Knowledge Organization Systems
Knowledge Organization SystemsKnowledge Organization Systems
Knowledge Organization Systems
 
Statistical Preliminaries
Statistical PreliminariesStatistical Preliminaries
Statistical Preliminaries
 
Intelligent natural language system
Intelligent natural language systemIntelligent natural language system
Intelligent natural language system
 
Big data: analyzing large data sets
Big data: analyzing large data setsBig data: analyzing large data sets
Big data: analyzing large data sets
 
Big Data and Harvesting Data from Social Media
Big Data and Harvesting Data from Social MediaBig Data and Harvesting Data from Social Media
Big Data and Harvesting Data from Social Media
 
Can You Really Make Best Use of Big Data?
Can You Really Make Best Use of Big Data?Can You Really Make Best Use of Big Data?
Can You Really Make Best Use of Big Data?
 
Chapter 5 decision tree induction using frequency tables for attribute selection
Chapter 5 decision tree induction using frequency tables for attribute selectionChapter 5 decision tree induction using frequency tables for attribute selection
Chapter 5 decision tree induction using frequency tables for attribute selection
 
Your amazing brain assembly
Your amazing brain assemblyYour amazing brain assembly
Your amazing brain assembly
 

Similar a Data Mining

Data Mining: an Introduction
Data Mining: an IntroductionData Mining: an Introduction
Data Mining: an IntroductionAli Abbasi
 
`Data mining
`Data mining`Data mining
`Data miningJebin R
 
chap4_basic_classification(2).ppt
chap4_basic_classification(2).pptchap4_basic_classification(2).ppt
chap4_basic_classification(2).pptssuserfdf196
 
chap4_basic_classification.ppt
chap4_basic_classification.pptchap4_basic_classification.ppt
chap4_basic_classification.pptBantiParsaniya
 
chap1_introT Data Mining
chap1_introT Data Miningchap1_introT Data Mining
chap1_introT Data Miningssuserfbb330
 
Data mining Basics and complete description onword
Data mining Basics and complete description onwordData mining Basics and complete description onword
Data mining Basics and complete description onwordSulman Ahmed
 
data mining presentation power point for the study
data mining presentation power point for the studydata mining presentation power point for the study
data mining presentation power point for the studyanjanishah774
 
lect1lect1lect1lect1lect1lect1lect1lect1.ppt
lect1lect1lect1lect1lect1lect1lect1lect1.pptlect1lect1lect1lect1lect1lect1lect1lect1.ppt
lect1lect1lect1lect1lect1lect1lect1lect1.pptDEEPAK948083
 

Similar a Data Mining (14)

Chap1 intro
Chap1 introChap1 intro
Chap1 intro
 
Data mining
Data miningData mining
Data mining
 
chap1_intro.ppt
chap1_intro.pptchap1_intro.ppt
chap1_intro.ppt
 
Data Mining: an Introduction
Data Mining: an IntroductionData Mining: an Introduction
Data Mining: an Introduction
 
`Data mining
`Data mining`Data mining
`Data mining
 
chap4_basic_classification(2).ppt
chap4_basic_classification(2).pptchap4_basic_classification(2).ppt
chap4_basic_classification(2).ppt
 
chap4_basic_classification.ppt
chap4_basic_classification.pptchap4_basic_classification.ppt
chap4_basic_classification.ppt
 
chap1_introT Data Mining
chap1_introT Data Miningchap1_introT Data Mining
chap1_introT Data Mining
 
Data mining Basics and complete description onword
Data mining Basics and complete description onwordData mining Basics and complete description onword
Data mining Basics and complete description onword
 
data mining presentation power point for the study
data mining presentation power point for the studydata mining presentation power point for the study
data mining presentation power point for the study
 
lect1lect1lect1lect1lect1lect1lect1lect1.ppt
lect1lect1lect1lect1lect1lect1lect1lect1.pptlect1lect1lect1lect1lect1lect1lect1lect1.ppt
lect1lect1lect1lect1lect1lect1lect1lect1.ppt
 
lect1.ppt
lect1.pptlect1.ppt
lect1.ppt
 
Lecture2.ppt
Lecture2.pptLecture2.ppt
Lecture2.ppt
 
Data mining applications
Data mining applicationsData mining applications
Data mining applications
 

Más de R A Akerkar

Rajendraakerkar lemoproject
Rajendraakerkar lemoprojectRajendraakerkar lemoproject
Rajendraakerkar lemoprojectR A Akerkar
 
Connecting and Exploiting Big Data
Connecting and Exploiting Big DataConnecting and Exploiting Big Data
Connecting and Exploiting Big DataR A Akerkar
 
Semi structure data extraction
Semi structure data extractionSemi structure data extraction
Semi structure data extractionR A Akerkar
 
artificial intelligence
artificial intelligenceartificial intelligence
artificial intelligenceR A Akerkar
 
Case Based Reasoning
Case Based ReasoningCase Based Reasoning
Case Based ReasoningR A Akerkar
 
Rational Unified Process for User Interface Design
Rational Unified Process for User Interface DesignRational Unified Process for User Interface Design
Rational Unified Process for User Interface DesignR A Akerkar
 
Unified Modelling Language
Unified Modelling LanguageUnified Modelling Language
Unified Modelling LanguageR A Akerkar
 
Statistics and Data Mining
Statistics and  Data MiningStatistics and  Data Mining
Statistics and Data MiningR A Akerkar
 
Software project management
Software project managementSoftware project management
Software project managementR A Akerkar
 
Personalisation and Fuzzy Bayesian Nets
Personalisation and Fuzzy Bayesian NetsPersonalisation and Fuzzy Bayesian Nets
Personalisation and Fuzzy Bayesian NetsR A Akerkar
 
Multi-agent systems
Multi-agent systemsMulti-agent systems
Multi-agent systemsR A Akerkar
 
Human machine interface
Human machine interfaceHuman machine interface
Human machine interfaceR A Akerkar
 
Reasoning in Description Logics
Reasoning in Description Logics  Reasoning in Description Logics
Reasoning in Description Logics R A Akerkar
 
Building an Intelligent Web: Theory & Practice
Building an Intelligent Web: Theory & PracticeBuilding an Intelligent Web: Theory & Practice
Building an Intelligent Web: Theory & PracticeR A Akerkar
 
Relationship between the Semantic Web and NLP
Relationship between the Semantic Web and NLPRelationship between the Semantic Web and NLP
Relationship between the Semantic Web and NLPR A Akerkar
 

Más de R A Akerkar (18)

Rajendraakerkar lemoproject
Rajendraakerkar lemoprojectRajendraakerkar lemoproject
Rajendraakerkar lemoproject
 
Connecting and Exploiting Big Data
Connecting and Exploiting Big DataConnecting and Exploiting Big Data
Connecting and Exploiting Big Data
 
Semi structure data extraction
Semi structure data extractionSemi structure data extraction
Semi structure data extraction
 
Link analysis
Link analysisLink analysis
Link analysis
 
artificial intelligence
artificial intelligenceartificial intelligence
artificial intelligence
 
Case Based Reasoning
Case Based ReasoningCase Based Reasoning
Case Based Reasoning
 
Rational Unified Process for User Interface Design
Rational Unified Process for User Interface DesignRational Unified Process for User Interface Design
Rational Unified Process for User Interface Design
 
Unified Modelling Language
Unified Modelling LanguageUnified Modelling Language
Unified Modelling Language
 
Statistics and Data Mining
Statistics and  Data MiningStatistics and  Data Mining
Statistics and Data Mining
 
Software project management
Software project managementSoftware project management
Software project management
 
Personalisation and Fuzzy Bayesian Nets
Personalisation and Fuzzy Bayesian NetsPersonalisation and Fuzzy Bayesian Nets
Personalisation and Fuzzy Bayesian Nets
 
Neural Networks
Neural NetworksNeural Networks
Neural Networks
 
Multi-agent systems
Multi-agent systemsMulti-agent systems
Multi-agent systems
 
Human machine interface
Human machine interfaceHuman machine interface
Human machine interface
 
Reasoning in Description Logics
Reasoning in Description Logics  Reasoning in Description Logics
Reasoning in Description Logics
 
Decision tree
Decision treeDecision tree
Decision tree
 
Building an Intelligent Web: Theory & Practice
Building an Intelligent Web: Theory & PracticeBuilding an Intelligent Web: Theory & Practice
Building an Intelligent Web: Theory & Practice
 
Relationship between the Semantic Web and NLP
Relationship between the Semantic Web and NLPRelationship between the Semantic Web and NLP
Relationship between the Semantic Web and NLP
 

Último

Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxChelloAnnAsuncion2
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxMaryGraceBautista27
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 

Último (20)

Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptxGrade 9 Q4-MELC1-Active and Passive Voice.pptx
Grade 9 Q4-MELC1-Active and Passive Voice.pptx
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptx
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptx
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 

Data Mining

  • 2. What Is Data Mining? n Data mining (knowledge discovery from data) ¨ Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data n Is everything “data mining”? ¨ (Deductive) query processing. ¨ Expert systems or small ML/statistical programs Build computer programs that sift through databases automatically, seeking regularities or patterns July 7, 2009 Data Mining: R. Akerkar 2
  • 3. Data Mining — What’s in a Name? Information Harvesting Knowledge Mining Data Mining Knowledge Discovery in Databases Data Dredging Data Archaeology Data Pattern Processing Database Mining Knowledge Extraction Siftware The process of discovering meaningful new correlations, patterns, and trends by sifting through large amounts of stored data, using pattern recognition technologies and statistical and mathematical techniques July 7, 2009 Data Mining: R. Akerkar 3
  • 4. Definition n Several Definitions ¨ Non-trivial extraction of implicit, previously unknown and potentially useful information from data ¨ Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996 July 7, 2009 Data Mining: R. Akerkar 4
  • 5. What is (not) Data Mining? lWhat is not Data l What is Data Mining? Mining? – Certain names are more – Look up phone common in certain Indian number in phone states (Joshi, Patil, directory Kulkarni… in Pune area). – Group together similar – Query a Web documents returned by search engine for search engine according to information about their context (e.g. Google “Pune” Scholar, Amazon.com,). July 7, 2009 Data Mining: R. Akerkar 5
  • 6. Origins of Data Mining n Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems n Traditional Techniques may be unsuitable due to Statistics/ Machine Learning/ ¨ Enormity of data AI Pattern ¨ High dimensionality Recognition of data Data Mining ¨ Heterogeneous, distributed nature of data Database systems July 7, 2009 Data Mining: R. Akerkar 6
  • 7. Data Mining Tasks n Prediction Methods ¨ Use some variables to predict unknown or future values of other variables. n Description Methods ¨ Find human-interpretable patterns that describe the data. July 7, 2009 Data Mining: R. Akerkar 7
  • 8. Data Mining Tasks... n Classification [Predictive] predicting an item class n Clustering [Descriptive] finding clusters in data n Association Rule Discovery [Descriptive] frequent occurring events n Deviation/Anomaly Detection [Predictive] finding changes July 7, 2009 Data Mining: R. Akerkar 8
  • 9. Classification: Definition n Given a collection of records (training set ) ¨ Each record contains a set of attributes, one of the attributes is the class. n Find a model for class attribute as a function of the values of other attributes. n Goal: previously unseen records should be assigned a class as accurately as possible. ¨A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. July 7, 2009 Data Mining: R. Akerkar 9
  • 10. Classification Example l l s rica rica ou go go inu te te nt s ca ca co clas Tid Refund Marital Taxable Refund Marital Taxable Status Income Cheat Status Income Cheat 1 Yes Single 125K No No Single 75K ? 2 No Married 100K No Yes Married 50K ? 3 No Single 70K No No Married 150K ? 4 Yes Married 120K No Yes Divorced 90K ? 5 No Divorced 95K Yes No Single 40K ? 6 No Married 60K No No Married 80K ? Test No 0 1 Set 7 Yes Divorced 220K 8 No Single 85K Yes 9 No Married 75K No Learn Training 10 No Single 90K Yes Classifier Model 10 Set July 7, 2009 Data Mining: R. Akerkar 10
  • 11. Classification: Application 1 n Direct Marketing ¨ Goal:Reduce cost of mailing by targeting a set of consumers likely to buy a new cell-phone product. ¨ Approach: n Use the data for a similar product introduced before. n We know which customers decided to buy and which decided otherwise. This {buy, don’t buy} decision forms the class attribute. n Collect various demographic, lifestyle, and company- interaction related information about all such customers. ¨ Type of business, where they stay, how much they earn, etc. n Use this information as input attributes to learn a classifier model. From [Berry & Linoff] Data Mining Techniques, 1997 July 7, 2009 Data Mining: R. Akerkar 11
  • 12. Classification: Application 2 n Fraud Detection ¨ Goal: Predict fraudulent cases in credit card transactions. ¨ Approach: n Use credit card transactions and the information on its account-holder as attributes. ¨ When does a customer buy, what does he buy, how often he pays on time, etc n Label past transactions as fraud or fair transactions. This forms the class attribute. n Learn a model for the class of the transactions. n Use this model to detect fraud by observing credit card transactions on an account. July 7, 2009 Data Mining: R. Akerkar 12
  • 13. Classification: Application 3 n Customer Attrition/Churn: ¨ Goal: To predict whether a customer is likely to be lost to a competitor. ¨ Approach: n Use detailed record of transactions with each of the past and present customers, to find attributes. ¨ How often the customer calls, where he calls, what time- of-the day he calls most, his financial status, marital status, etc. n Label the customers as loyal or disloyal. n Find a model for loyalty. From [Berry & Linoff] Data Mining Techniques, 1997 July 7, 2009 Data Mining: R. Akerkar 13
  • 15. Introduction n A classification scheme which generates a tree and a set of rules from given data set. n The set of records available for developing classification methods is divided into two disjoint subsets – a training set and a test set. n The attributes of the records are categorise into two types: ¨ Attributes whose domain is numerical are called numerical attributes. ¨ Attributes whose domain is not numerical are called the categorical attributes. July 7, 2009 Data Mining: R. Akerkar 15
  • 16. Introduction n A decision tree is a tree with the following properties: ¨ An inner node represents an attribute. ¨ An edge represents a test on the attribute of the father node. ¨ A leaf represents one of the classes. n Construction of a decision tree ¨ Based on the training data ¨ Top-Down strategy July 7, 2009 Data Mining: R. Akerkar 16
  • 17. Decision Tree Example n The data set has five attributes. n There is a special attribute: the attribute class is the class label. n The attributes, temp (temperature) and humidity are numerical attributes n Other attributes are categorical, that is, they cannot be ordered. n Based on the training data set, we want to find a set of rules to know what values of outlook, temperature, humidity and wind, determine whether or not to play golf. July 7, 2009 Data Mining: R. Akerkar 17
  • 18. Decision Tree Example n We have five leaf nodes. n In a decision tree, each leaf node represents a rule. n We have the following rules corresponding to the tree given in Figure. n RULE 1 If it is sunny and the humidity is not above 75%, then play. n RULE 2 If it is sunny and the humidity is above 75%, then do not play. n RULE 3 If it is overcast, then play. n RULE 4 If it is rainy and not windy, then play. n RULE 5 If it is rainy and windy, then don't play. July 7, 2009 Data Mining: R. Akerkar 18
  • 19. Classification n The classification of an unknown input vector is done by traversing the tree from the root node to a leaf node. n A record enters the tree at the root node. n At the root, a test is applied to determine which child node the record will encounter next. n This process is repeated until the record arrives at a leaf node. n All the records that end up at a given leaf of the tree are classified in the same way. n There is a unique path from the root to each leaf. n The path is a rule which is used to classify the records. July 7, 2009 Data Mining: R. Akerkar 19
  • 20. n In our tree, we can carry out the classification for an unknown record as follows. n Let us assume, for the record, that we know the values of the first four attributes (but we do not know the value of class attribute) as n outlook= rain; temp = 70; humidity = 65; and windy= true. July 7, 2009 Data Mining: R. Akerkar 20
  • 21. n We start from the root node to check the value of the attribute associated at the root node. n This attribute is the splitting attribute at this node. n For a decision tree, at every node there is an attribute associated with the node called the splitting attribute. n In our example, outlook is the splitting attribute at root. n Since for the given record, outlook = rain, we move to the right-most child node of the root. n At this node, the splitting attribute is windy and we find that for the record we want classify, windy = true. n Hence, we move to the left child node to conclude that the class label Is "no play". July 7, 2009 Data Mining: R. Akerkar 21
  • 22. n The accuracy of the classifier is determined by the percentage of the test data set that is correctly classified. n We can see that for Rule 1 there are two records of the test data set satisfying outlook= sunny and humidity < 75, and only one of these is correctly classified as play. n Thus, the accuracy of this rule is 0.5 (or 50%). Similarly, the accuracy of Rule 2 is also 0.5 (or 50%). The accuracy of Rule 3 is 0.66. RULE 1 If it is sunny and the humidity is not above 75%, then play. July 7, 2009 Data Mining: R. Akerkar 22
  • 23. Concept of Categorical Attributes n Consider the following training data set. n There are three attributes, namely, age, pincode and class. n The attribute class is used for class label. The attribute age is a numeric attribute, whereas pincode is a categorical one. Though the domain of pincode is numeric, no ordering can be defined among pincode values. You cannot derive any useful information if one pin-code is greater than another pincode. July 7, 2009 Data Mining: R. Akerkar 23
  • 24. n Figure gives a decision tree for the training data. n The splitting attribute at the root is pincode and the splitting criterion here is pincode = 500 046. n Similarly, for the left child node, the splitting criterion is age < 48 At root level, we have 9 records. (the splitting attribute is age). The associated splitting criterion is pincode = 500 046. n Although the right child node As a result, we split the records has the same attribute as the into two subsets. Records 1, 2, 4, 8, splitting attribute, the splitting and 9 are to the left child note and criterion is different. remaining to the right node. The process is repeated at every node. July 7, 2009 Data Mining: R. Akerkar 24
  • 25. Advantages and Shortcomings of Decision Tree Classifications n A decision tree construction process is concerned with identifying the splitting attributes and splitting criterion at every level of the tree. n Major strengths are: ¨ Decision tree able to generate understandable rules. ¨ They are able to handle both numerical and categorical attributes. ¨ They provide clear indication of which fields are most important for prediction or classification. n Weaknesses are: ¨ The process of growing a decision tree is computationally expensive. At each node, each candidate splitting field is examined before its best split can be found. ¨ Some decision tree can only deal with binary-valued target classes. July 7, 2009 Data Mining: R. Akerkar 25
  • 26. Iterative Dichotomizer (ID3) n Quinlan (1986) n Each node corresponds to a splitting attribute n Each arc is a possible value of that attribute. n At each node the splitting attribute is selected to be the most informative among the attributes not yet considered in the path from the root. n Entropy is used to measure how informative is a node. n The algorithm uses the criterion of information gain to determine the goodness of a split. ¨ The attribute with the greatest information gain is taken as the splitting attribute, and the data set is split for all distinct values of the attribute. July 7, 2009 Data Mining: R. Akerkar 26
  • 27. Training Dataset The class label attribute, This follows an example from Quinlan’s ID3 buys_computer, has two distinct values. age income student credit_rating buys_computer Thus there are two distinct <=30 high no fair no classes. (m =2) <=30 high no excellent no 31…40 high no fair yes Class C1 corresponds to yes >40 medium no fair yes and class C2 corresponds to no. >40 low yes fair yes >40 low yes excellent no There are 9 samples of class yes 31…40 low yes excellent yes and 5 samples of class no. <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no July 7, 2009 Data Mining: R. Akerkar 27
  • 28. Extracting Classification Rules from Trees n Represent the knowledge in the form of IF-THEN rules n One rule is created for each path from the root to a leaf n Each attribute-value pair along a path forms a conjunction n The leaf node holds the class prediction n Rules are easier for humans to What are the rules? understand July 7, 2009 Data Mining: R. Akerkar 28
  • 29. Solution (Rules) IF age = “<=30” AND student = “no” THEN buys_computer = “no” IF age = “<=30” AND student = “yes” THEN buys_computer = “yes” IF age = “31…40” THEN buys_computer = “yes” IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes” IF age = “<=30” AND credit_rating = “fair” THEN buys_computer = “no” July 7, 2009 Data Mining: R. Akerkar 29
  • 30. Algorithm for Decision Tree Induction n Basic algorithm (a greedy algorithm) ¨ Tree is constructed in a top-down recursive divide-and-conquer manner ¨ At start, all the training examples are at the root ¨ Attributes are categorical (if continuous-valued, they are discretized in advance) ¨ Examples are partitioned recursively based on selected attributes ¨ Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) n Conditions for stopping partitioning ¨ All samples for a given node belong to the same class ¨ There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf ¨ There are no samples left July 7, 2009 Data Mining: R. Akerkar 30
  • 31. Attribute Selection Measure: Information Gain (ID3/C4.5) n Select the attribute with the highest information gain n S contains si tuples of class Ci for i = {1, …, m} n information measures info required to classify any arbitrary tuple m si si I( s1,s2,...,s m ) = − ∑ log 2 n i =1 s s ….information is encoded in bits. n entropy of attribute A with values {a1,a2,…,av} v s1 j + ...+ smj E(A)= ∑ I ( s1 j ,...,smj ) j =1 s n information gained by branching on attribute A Gain(A) = I(s 1, s 2 ,..., s m ) − E(A) July 7, 2009 Data Mining: R. Akerkar 31
  • 32. Entropy n Entropy measures the homogeneity (purity) of a set of examples. n It gives the information content of the set in terms of the class labels of the examples. n Consider that you have a set of examples, S with two classes, P and N. Let the set have p instances for the class P and n instances for the class N. n So the total number of instances we have is t = p + n. The view [p, n] can be seen as a class distribution of S. The entropy for S is defined as n Entropy(S) = - (p/t).log2(p/t) - (n/t).log2(n/t) n Example: Let a set of examples consists of 9 instances for class positive, and 5 instances for class negative. n Answer: p = 9 and n = 5. n So Entropy(S) = - (9/14).log2(9/14) - (5/14).log2(5/14) n = -(0.64286)(-0.6375) - (0.35714)(-1.48557) n = (0.40982) + (0.53056) n = 0.940 July 7, 2009 Data Mining: R. Akerkar 32
  • 33. Entropy The entropy for a completely pure set is 0 and is 1 for a set with equal occurrences for both the classes. i.e. Entropy[14,0] = - (14/14).log2(14/14) - (0/14).log2(0/14) = -1.log2(1) - 0.log2(0) = -1.0 - 0 =0 i.e. Entropy[7,7] = - (7/14).log2(7/14) - (7/14).log2(7/14) = - (0.5).log2(0.5) - (0.5).log2(0.5) = - (0.5).(-1) - (0.5).(-1) = 0.5 + 0.5 =1 July 7, 2009 Data Mining: R. Akerkar 33
  • 34. Attribute Selection by Information Gain Computation 5 4 g Class P: buys_computer = “yes” E ( age ) = I ( 2,3) + I ( 4, 0 ) 14 14 g Class N: buys_computer = “no” 5 g I(p, n) = I(9, 5) =0.940 + I (3, 2 ) = 0 .694 g Compute the entropy for age: 14 age pi ni I(pi, ni) 5 <=30 2 3 0.971 I ( 2 ,3 ) means “age <=30” has 14 30…40 4 0 0 5 out of 14 samples, with 2 >40 3 2 0.971 yes's and 3 no’s. Hence age income student credit_rating buys_computer <=30 <=30 high high no no fair excellent no no Gain (age ) = I ( p, n) − E (age ) = 0.246 31…40 high no fair yes >40 medium no fair yes Gain (income) = 0.029 >40 low yes fair yes >40 low yes excellent no Similarly, Gain ( student ) = 0.151 31…40 low yes excellent yes <=30 medium no fair no Gain (credit _ rating ) = 0.048 <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes Since, age has the highest information gain among 31…40 medium no excellent yes the attributes, it is selected as the test attribute. 31…40 high yes fair yes >40 July medium 7, 2009 no excellent Data Mining: R. Akerkar no 34
  • 35. Exercise 1 n The following table consists of training data from an employee database. n Let status be the class attribute. Use the ID3 algorithm to construct a decision tree from the given data. July 7, 2009 Data Mining: R. Akerkar 35
  • 36. Solution 1 July 7, 2009 Data Mining: R. Akerkar 36
  • 37. Other Attribute Selection Measures n Gini index (CART, IBM IntelligentMiner) ¨ All attributes are assumed continuous-valued ¨ Assume there exist several possible split values for each attribute ¨ May need other tools, such as clustering, to get the possible split values ¨ Can be modified for categorical attributes July 7, 2009 Data Mining: R. Akerkar 37
  • 38. Gini Index (IBM IntelligentMiner) n If a data set T contains examples from n classes, gini index, n gini(T) is defined as gini ( T ) = 1 − ∑ p 2 j j =1 where pj is the relative frequency of class j in T. n If a data set T is split into two subsets T1 and T2 with sizes N1 and N2 respectively, the gini index of the split data contains examples from n classes, the gini index gini(T) is defined as gini split (T ) = N 1 gini (T 1) + N 2 gini (T 2 ) N N n The attribute provides the smallest ginisplit(T) is chosen to split the node (need to enumerate all possible splitting points for each attribute). July 7, 2009 Data Mining: R. Akerkar 38
  • 39. Exercise 2 July 7, 2009 Data Mining: R. Akerkar 39
  • 40. Solution 2 n SPLIT: Age <= 50 n ---------------------- n | High | Low | Total n -------------------- n S1 (left) | 8 | 11 | 19 n S2 (right) | 11 | 10 | 21 n -------------------- n For S1: P(high) = 8/19 = 0.42 and P(low) = 11/19 = 0.58 n For S2: P(high) = 11/21 = 0.52 and P(low) = 10/21 = 0.48 n Gini(S1) = 1-[0.42x0.42 + 0.58x0.58] = 1-[0.18+0.34] = 1-0.52 = 0.48 n Gini(S2) = 1-[0.52x0.52 + 0.48x0.48] = 1-[0.27+0.23] = 1-0.5 = 0.5 n Gini-Split(Age<=50) = 19/40 x 0.48 + 21/40 x 0.5 = 0.23 + 0.26 = 0.49 n SPLIT: Salary <= 65K n ---------------------- n | High | Low | Total n -------------------- n S1 (top) | 18 | 5 | 23 n S2 (bottom) | 1 | 16 | 17 n -------------------- n For S1: P(high) = 18/23 = 0.78 and P(low) = 5/23 = 0.22 n For S2: P(high) = 1/17 = 0.06 and P(low) = 16/17 = 0.94 n Gini(S1) = 1-[0.78x0.78 + 0.22x0.22] = 1-[0.61+0.05] = 1-0.66 = 0.34 n Gini(S2) = 1-[0.06x0.06 + 0.94x0.94] = 1-[0.004+0.884] = 1-0.89 = 0.11 n Gini-Split(Age<=50) = 23/40 x 0.34 + 17/40 x 0.11 = 0.20 + 0.05 = 0.25 July 7, 2009 Data Mining: R. Akerkar 40
  • 41. Exercise 3 n In previous exercise, which is a better split of the data among the two split points? Why? July 7, 2009 Data Mining: R. Akerkar 41
  • 42. Solution 3 n Intuitively Salary <= 65K is a better split point since it produces relatively ``pure'' partitions as opposed to Age <= 50, which results in more mixed partitions (i.e., just look at the distribution of Highs and Lows in S1 and S2). n More formally, let us consider the properties of the Gini index. If a partition is totally pure, i.e., has all elements from the same class, then gini(S) = 1-[1x1+0x0] = 1-1 = 0 (for two classes). On the other hand if the classes are totally mixed, i.e., both classes have equal probability then gini(S) = 1 - [0.5x0.5 + 0.5x0.5] = 1-[0.25+0.25] = 0.5. In other words the closer the gini value is to 0, the better the partition is. Since Salary has lower gini it is a better split. July 7, 2009 Data Mining: R. Akerkar 42
  • 44. Clustering: Definition n Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that ¨ Data points in one cluster are more similar to one another. ¨ Data points in separate clusters are less similar to one another. n Similarity Measures: ¨ Euclidean Distance if attributes are continuous. ¨ Other Problem-specific Measures. July 7, 2009 Data Mining: R. Akerkar 44
  • 45. Clustering: Illustration x Euclidean Distance Based Clustering in 3-D space. Intracluster distances Intercluster distances are minimized are maximized July 7, 2009 Data Mining: R. Akerkar 45
  • 46. Clustering: Application 1 n Market Segmentation: ¨ Goal: subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix. ¨ Approach: n Collect different attributes of customers based on their geographical and lifestyle related information. n Find clusters of similar customers. n Measure the clustering quality by observing buying patterns of customers in same cluster vs. those from different clusters. July 7, 2009 Data Mining: R. Akerkar 46
  • 47. Clustering: Application 2 n Document Clustering: ¨ Goal: To find groups of documents that are similar to each other based on the important terms appearing in them. ¨ Approach: To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster. ¨ Gain: Information Retrieval can utilize the clusters to relate a new document or search term to clustered documents. July 7, 2009 Data Mining: R. Akerkar 47
  • 49. Clustering n Clustering is the process of grouping data into clusters so that objects within a cluster have similarity in comparison to one another, but are very dissimilar to objects in other clusters. n The similarities are assessed based on the attributes values describing these objects. July 7, 2009 Data Mining: R. Akerkar 49
  • 50. The K-Means Clustering Method n Given k, the k-means algorithm is implemented in four steps: ¨ Partition objects into k nonempty subsets ¨ Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster) ¨ Assign each object to the cluster with the nearest seed point ¨ Go back to Step 2, stop when no more new assignment July 7, 2009 Data Mining: R. Akerkar 50
  • 51. The K-Means Clustering Method n Example 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 4 Assign 3 Update 3 the 3 each 2 2 2 1 objects 1 cluster 1 0 means 0 0 to most 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 similar center reassign reassign 10 10 K=2 9 9 8 8 Arbitrarily choose K 7 7 object as initial 6 6 5 5 cluster center 4 Update 4 3 2 the 3 2 1 cluster 1 0 0 1 2 3 4 5 6 7 8 9 10 means 0 0 1 2 3 4 5 6 7 8 9 10 July 7, 2009 Data Mining: R. Akerkar 51
  • 52. K-Means Clustering n K-means is a partition based clustering algorithm. n K-means’ goal: Partition database D into K parts, where there is little similarity across groups, but great similarity within a group. More specifically, K-means aims to minimize the mean square error of each point in a cluster, with respect to its cluster centroid. July 7, 2009 Data Mining: R. Akerkar 52
  • 53. K-Means Example A1 2 4 n Consider the following one-dimensional 10 database with attribute A1. 12 3 n Let us use the k-means algorithm to partition 20 this database into k = 2 clusters. We begin by 30 choosing two random starting points, which will serve as the centroids of the two clusters. 11 25 µ C1 = 2 µ C2 = 4 July 7, 2009 Data Mining: R. Akerkar 53
  • 54. Cluster A1 n To form clusters, we assign each Assignment point in the database to the C1 2 nearest centroid. C2 4 n For instance, 10 is closer to c2 C2 10 than to c1. C2 12 n If a point is the same distance C1 3 from two centroids, such as point C2 20 3 in our example, we make an C2 30 C2 11 arbitrary assignment. C2 25 July 7, 2009 Data Mining: R. Akerkar 54
  • 55. n Once all points have been assigned, we recompute the means of the clusters. 2+3 µ C1 = = 2 .5 2 4 + 10 + 12 + 20 + 30 + 11 + 25 112 µ C2 = = = 16 7 7 July 7, 2009 Data Mining: R. Akerkar 55
  • 56. n We then reassign each point to the two Clusters A1 clusters based on the new means. C1 2 C1 4 n Remark: point 4 now belongs to cluster C2 10 C1. C2 12 C1 3 n The steps are repeated until the means C2 20 converge to their optimal values. In C2 30 each iteration, the means are re- C2 11 computed and all points are reassigned. C2 25 July 7, 2009 Data Mining: R. Akerkar 56
  • 57. n In this example, only one more iteration is needed before the means converge. We compute the new means: 2+3+ 4 µ C1 = =3 3 10 + 12 + 20 + 30 + 11 + 25 108 µ C2 = = = 18 6 6 Now if we reassign the points there is no change in the clusters. Hence the means have converged to their optimal values and the algorithm terminates. July 7, 2009 Data Mining: R. Akerkar 57
  • 58. Visualization of k-means algorithm July 7, 2009 Data Mining: R. Akerkar 58
  • 59. Exercise n Apply the K-means algorithm for the following 1-dimensional points (for k=2): 1; 2; 3; 4; 6; 7; 8; 9. n Use 1 and 2 as the starting centroids. July 7, 2009 Data Mining: R. Akerkar 59
  • 60. Solution Iteration #1 1: 1 mean = 1 2: 2,3,4,6,7,8,9 mean = 5.57 Iteration #2 1: 1,2,3 mean = 2 5.57: 4,6,7,8,9 mean = 6.8 Iteration #3 2: 1,2,3,4 mean = 2.5 6.8: 6,7,8,9 mean = 7.5 Iteration #4 2.5: 1,2,3,4 mean = 2.5 7.5: 6,7,8,9 mean = 7.5 Means haven’t changed, so stop iterating. The final clusters are {1,2,3,4} and {6,7,8,9}. July 7, 2009 Data Mining: R. Akerkar 60
  • 61. K – Mean for 2-dimensional database n Let us consider {x1, x2, x3, x4, x5} with following coordinates as two-dimensional sample for clustering: n x1 = (0, 2), x2 = (0, 0), x3 = (1.5,0), x4 = (5,0), x5 = (5, 2) n Suppose that required number of clusters is 2. n Initially, clusters are formed from random distribution of samples: n C1 = {x1, x2, x4} and C2 = {x3, x5}. July 7, 2009 Data Mining: R. Akerkar 61
  • 62. Centroid Calculation n Suppose that the given set of N samples in an n-dimensional space has somehow be partitioned into K clusters {C1, C2, …, Ck} n Each Ck has nk samples and each sample is exactly in one cluster. n Therefore, Σ nk = N, where k = 1, …, K. nk n The mean vector Mk of cluster Ck is defined as centroid of the cluster, Where xik is the ith sample belonging Σi = 1 xik to cluster Ck. Mk = (1/ nk) n In our example, The centroids for these two clusters are n M1 = {(0 + 0 + 5)/3, (2 + 0 + 0)/3} = {1.66, 0.66} n M2 = {( 1.5 + 5)/2, (0 +2)/2} = {3.25, 1.00} July 7, 2009 Data Mining: R. Akerkar 62
  • 63. The Square-error of the cluster n The square-error for cluster Ck is the sum of squared Euclidean distances between each sample in Ck and its centroid. n This error is called the within-cluster variation. nk ek2 = Σi = 1 (xik – Mk)2 n Within cluster variations, after initial random distribution of samples, are n e12 = [(0 – 1.66)2 + (2 – 0.66)2] + [(0 – 1.66)2 + (0 – 0.66)2] + [(5 – 1.66)2 + (0 – 0.66)2] = 19.36 n e22 = [(1.5 – 3.25)2 + (0 – 1)2] + [(5 – 3.25)2 + (2 – 1)2] = 8.12 July 7, 2009 Data Mining: R. Akerkar 63
  • 64. Total Square-error n The square error for the entire clustering space containing K clusters is the sum of the within-cluster variations. K Ek2 = Σk = 1 ek2 n The total square error is E2 = e12 + e22 = 19.36 + 8.12 = 27.48 July 7, 2009 Data Mining: R. Akerkar 64
  • 65. n When we reassign all samples, depending on a minimum distance from centroids M1 and M2, the new redistribution of samples inside clusters will be, n d(M1, x1) = (1.662 + 1.342)1/2 = 2.14 and d(M2, x1) = 3.40 ⇒ x1 ∈ C1 n d(M1, x2) = 1.79 and d(M2, x2) = 3.40 ⇒ x2 ∈ C1 d(M1, x3) = 0.83 and d(M2, x3) = 2.01 ⇒ x3 ∈ C1 d(M1, x4) = 3.41 and d(M2, x4) = 2.01 ⇒ x4 ∈ C2 d(M1, x5) = 3.60 and d(M2, x5) = 2.01 ⇒ x5 ∈ C2 Above calculation is based on Euclidean distance formula, m d(xi, xj) = Σk = 1 (xik – xjk)1/2 July 7, 2009 Data Mining: R. Akerkar 65
  • 66. n New Clusters C1 = {x1, x2, x3} and C2 = {x4, x5} have new centroids n M1 = {0.5, 0.67} n M2 = {5.0, 1.0} n The corresponding within-cluster variations and the total square error are, n e12 = 4.17 n e22 = 2.00 n E2 = 6.17 July 7, 2009 Data Mining: R. Akerkar 66
  • 67. The cluster membership stabilizes… n After the first iteration, the total square error is significantly reduced: ( from 27.48 to 6.17) n In this example, if we analysis the distances between the new centroids and the samples, the second iteration will be assigned to the same clusters. n Thus no further reassignment and algorithm halts. July 7, 2009 Data Mining: R. Akerkar 67
  • 68. Variations of the K-Means Method n A few variants of the k-means which differ in ¨ Selection of the initial k means ¨ Strategies to calculate cluster means n Handling categorical data: k-modes (Huang’98) ¨ Replacing means of clusters with modes ¨ Using new dissimilarity measures to deal with categorical objects ¨ Using a frequency-based method to update modes of clusters ¨ A mixture of categorical and numerical data: k-prototype method July 7, 2009 Data Mining: R. Akerkar 68
  • 69. What is the problem of k-Means Method? n The k-means algorithm is sensitive to outliers ! ¨ Since an object with an extremely large value may substantially distort the distribution of the data. n K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster. 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 July 7, 2009 Data Mining: R. Akerkar 69
  • 70. Exercise 2 n Let the set X consist of the following sample points in 2 dimensional space: n X = {(1, 2), (1.5, 2.2), (3, 2.3), (2.5,-1), (0, 1.6), (-1,1.5)} n Let c1 = (1.5, 2.5) and c2 = (3, 1) be initial estimates of centroids for X. n What are the revised values of c1 and c2 after 1 iteration of k-means clustering (k = 2)? July 7, 2009 Data Mining: R. Akerkar 70
  • 71. Solution 2 n For each data point, calculate the distance to each centroid: x y d(xi,c1) d(xi,c2) x1 1 2 0.707107 2.236068 x2 1.5 2.2 0.3 1.920937 x3 3 2.3 1.513275 1.3 x4 2.5 -1 3.640055 2.061553 x5 0 1.6 1.749286 3.059412 x6 -1 1.5 2.692582 4.031129 July 7, 2009 Data Mining: R. Akerkar 71
  • 72. n It follows that x1, x2, x5 and x6 are closer to c1 and the other points are closer to c2. Hence replace c1 with the average of x1, x2, x5 and x6 and replace c2 with the average of x3 and x4. This gives: n c1’ = (0.375, 1.825) n c2’ = (2.75, 0.65) July 7, 2009 Data Mining: R. Akerkar 72
  • 73. Association Rule Discovery July 7, 2009 Data Mining: R. Akerkar 73
  • 74. Market-basket problem. n We are given a set of items and a large collection of transactions, which are subsets (baskets) of these items. n Task: To find relationships between the presences of various items within these baskets. n Example: To analyze customers' buying habits by finding associations between the different items that customers place in their shopping baskets. July 7, 2009 Data Mining: R. Akerkar 74
  • 75. Associations discovery n Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories ¨ Associations discovery uncovers affinities amongst collection of items ¨ Affinities are represented by association rules ¨ Associations discovery is an unsupervised approach to data mining. July 7, 2009 Data Mining: R. Akerkar 75
  • 76. Association Rule : Application 2 n Supermarket shelf management. ¨ Goal: To identify items that are bought together by sufficiently many customers. ¨ Approach: Process the point-of-sale data collected with barcode scanners to find dependencies among items. ¨A classic rule -- n If a customer buys diaper and milk, then he is very likely to buy beer. n So, don’t be surprised if you find six-packs stacked next to diapers! July 7, 2009 Data Mining: R. Akerkar 76
  • 77. Association Rule : Application 3 n Inventory Management: ¨ Goal: A consumer appliance repair company wants to anticipate the nature of repairs on its consumer products and keep the service vehicles equipped with right parts to reduce on number of visits to consumer households. ¨ Approach: Process the data on tools and parts required in previous repairs at different consumer locations and discover the co- occurrence patterns. July 7, 2009 Data Mining: R. Akerkar 77
  • 78. What is a rule? n The rule in a rule induction system comes in the form “If this and this and this then this” n For a rule to be useful two pieces of information are needed 1. Accuracy (The lower the accuracy the closer the rule comes to random guessing) 2. Coverage (how often you can use a useful rule) n A Rule consists of two parts 1. The antecedent or the LHS 2. The consequent or the RHS. July 7, 2009 Data Mining: R. Akerkar 78
  • 79. An example Rule Accuracy Coverage If breakfast cereal purchased 85% 20% then milk will be purchased If bread purchased, then swiss 15% 6% cheese will be purchased If 42 years old and purchased 95% 0.01% dry roasted peanuts, beer purchased July 7, 2009 Data Mining: R. Akerkar 79
  • 80. What is association rule mining? July 7, 2009 Data Mining: R. Akerkar 80
  • 81. Frequent Itemset July 7, 2009 Data Mining: R. Akerkar 81
  • 82. Support and Confidence July 7, 2009 Data Mining: R. Akerkar 82
  • 83. What to do with a rule? n Target the antecedent n Target the consequent n Target based on accuracy n Target based on coverage n Target based on “interestingness” n Antecedent can be one or more conditions all of which must be true in order for the consequent to be true at the given accuracy. n Generally the consequent is just a simple condition (eg purchasing one grocery store item) rather than multiple items. July 7, 2009 Data Mining: R. Akerkar 83
  • 84. All rules that have a certain value for the antecedent are gathered and presented to the user. • For example the grocery store may request all rules that have nails or bolts or screws in the antecedent and try and conclude whether discontinuing sales of these lower priced items will have any effect on the higher margin items like hammers. • All rules that have a certain value for the consequent are gathered. Can be used to understand what affects the consequent. • For instance might be useful to know what rules have “coffee” in their RHS. Store owner might want to put coffee close to other items in order to increase sales of both items, or a manufacturer may determine in which magazine to place next coupons. July 7, 2009 Data Mining: R. Akerkar 84
  • 85. Sometimes accuracy most important. Highly accurate rules of 80 or 90% of the time imply strong relationships even if the coverage is very low. • For example lets say a rule can only be applied one time out of 1000 but if this rule is very profitable the one time then it can be worthwhile. This is how most successful data mining applications work in the financial markets looking for that limited amount of time in which a very confident prediction can be made. • Sometimes users want to know the rules that are most widely applicable. By looking at rules ranked by coverage they can get a high level view of what is happening in the database most of the time • Rules are interesting when they have high coverage and high accuracy but they deviate from the norm. Eventually there may be a tradeoff between coverage and accuracy can be made using a measure of interestingness July 7, 2009 Data Mining: R. Akerkar 85
  • 86. Evaluating and using rules n Look at simple statistics. n Using conjunctions and disjunctions n Defining “interestingness” n Other Heuristics July 7, 2009 Data Mining: R. Akerkar 86
  • 87. Using conjunctions and disjunctions n This dramatically increases or decreases the coverage. For example ¨ If diet soda or regular soda or beer then potato chips, covers a lot more shopping baskets than just one of the constraints by themselves. July 7, 2009 Data Mining: R. Akerkar 87
  • 88. Defining “interestingness” n Interestingness must have 4 basic behaviors 1. Interestingness=0. Rule accuracy is equal to background (a priori probability of the LHS), then discard rule. 2. Interestingness increases as accuracy increases if coverage fixed 3. Interestingness increases or decreases with coverage if accuracy stays fixed. 4. Interestingness decreases with coverage for a fixed number of correct responses. July 7, 2009 Data Mining: R. Akerkar 88
  • 89. Other Heuristics n Look at the actual number of records covered and not as a probability or a percentage. n Compare a given pattern to random chance. This will be an “out of the ordinary measure”. n Keep it simple July 7, 2009 Data Mining: R. Akerkar 89
  • 90. Example Here t supports items C, DM, and CO. The item DM is supported by 4 out of 6 transactions in T. Thus, the support of DM is 66.6%. July 7, 2009 Data Mining: R. Akerkar 90
  • 91. Definition July 7, 2009 Data Mining: R. Akerkar 91
  • 92. Association Rules n Algorithms that obtain association rules from data usually divide the task into two parts: ¨ findthe frequent itemsets and ¨ form the rules from them. July 7, 2009 Data Mining: R. Akerkar 92
  • 93. Association Rules n The problem of mining association rules can be divided into two subproblems: July 7, 2009 Data Mining: R. Akerkar 93
  • 94. Definitions July 7, 2009 Data Mining: R. Akerkar 94
  • 95. a priori algorithm n Agrawal and Srikant in 1994. n It is also called the level-wise algorithm. ¨ It is the most accepted algorithm for finding all the frequent sets. ¨ It makes use of the downward closure property. ¨ The algorithm is a bottom-up search, progressing upward level-wise in the lattice. n The interesting fact – ¨ before reading the database at every level, it prunes many of the sets, which are unlikely to be frequent sets. July 7, 2009 Data Mining: R. Akerkar 95
  • 96. a priori algorithm July 7, 2009 Data Mining: R. Akerkar 96
  • 97. a priori candidate-generation method July 7, 2009 Data Mining: R. Akerkar 97
  • 98. Pruning algorithm July 7, 2009 Data Mining: R. Akerkar 98
  • 99. a priori Algorithm July 7, 2009 Data Mining: R. Akerkar 99
  • 100. Exercise 3 Suppose that L3 is the list {{a,b,c}, {a,b,d}. {a,c,d}, {b,c,d}. {b,c,w}, {b,c,x}, {p,q,r}, {p,q,s}, {p,q,t}, {p,r,s}, {q,r,s}} Which itemsets are placed in C4 by the join step of the Apriori algorithm? Which are then removed by the prune step? July 7, 2009 Data Mining: R. Akerkar 100
  • 101. Solution3 n At the join step of Apriori Algorithm, each member (set) is compared with every other member. n If all the elements of the two members are identical except the right most ones, the union of the two sets is placed into C4. n For the members of L3 given the following sets of four elements are placed into C4: {a,b,c,d}, {b,c,d,w}, {b,c,d,x}, {b,c,w,x}, {p,q,r,s}, {p,q,r,t} and {p,q,s,t}. July 7, 2009 Data Mining: R. Akerkar 101
  • 102. Solution3 (continued) n At the prune step of the algorithm, each member of C4 is checked to see whether all its subsets of 3 elements are members of L3. n The result in this case is as follows: July 7, 2009 Data Mining: R. Akerkar 102
  • 103. Solution3 (continued) n Therefore, {b,c,d,w}, {b,c,d,x}, {b,c,w,x}, {p,q,r,t} and {p.q.s.t} are removed by the prune step n Leaving C4 as, {{a,b,c,d}, {p,q,r,s}} July 7, 2009 Data Mining: R. Akerkar 103
  • 104. Exercise 4 n Given a dataset with four attributes w, x, y and z, each with three values, how many rules can be generated with one term on the right-hand side? July 7, 2009 Data Mining: R. Akerkar 104
  • 105. Solution 4 n Let us assume that the attribute w has 3 values w1, w2, and w3, and similarly for x, y, and z. n If we select arbitrarily attribute w to be on the right-hand side of each rule, there are 3 possible types of rule: ¨ IF…THEN w=w1 ¨ IF…THEN w=w2 ¨ IF…THEN w=w3 n Now choose one of these rules, say the first, and calculate how many possible left hand sides there are for such rules. July 7, 2009 Data Mining: R. Akerkar 105
  • 106. Solution 4 (continued) n The number of “attribute=value” terms on the LHS can be 1, 2, or 3. n Case I: One trem on LHS ¨ There are 3 possible terms: x, y, and z. Each has 3 possible values, so there are 3x3=9 possible LHS, e.g. IF x=x1. July 7, 2009 Data Mining: R. Akerkar 106
  • 107. Solution 4 (continued) n Case II: 2 terms on LHS ¨ There are 3 ways in which combination of 2 attributes may appear on the LHS: x and y, y and z, and x and z. ¨ Each attribute has 3 values, so for each pair there are 3x3=9 possible LHS, e.g. IF x=x1 AND y=y1 ¨ There are 3 possible pairs of attributes, so the totle number of possible LHS is 3x9=27. July 7, 2009 Data Mining: R. Akerkar 107
  • 108. Solution 4 (continued) n Case III: 3 terms on LHS ¨ All 3 attributes x, y and z must be on LHS. ¨ Each has 3 values, so 3x3x3=27 possible LHS, e.g. IF x=x1 AND y=y1 AND z=z1. ¨ Thus for each of the 3 possible “w=value” terms on the RHS, the total number of LHS with 1,2 or 3 terms is 9+27+27=63. ¨ So there are 3x63 = 189 possible rules with attribute w on the RHS. ¨ The attribute on the RHS could be any of four possibilities (not just w). Therefore total possible number of rules is 4x189=756. July 7, 2009 Data Mining: R. Akerkar 108
  • 109. References n R. Akerkar and P. Lingras. Building an Intelligent Web: Theory & Practice, Jones & Bartlett, 2008 (In India: Narosa Publishing House, 2009) n U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996 n U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann, 2001 n J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2001 n D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001 July 7, 2009 Data Mining: R. Akerkar 109