SlideShare una empresa de Scribd logo
1 de 46
Descargar para leer sin conexión
DBM630: Data Mining and
                       Data Warehousing

                              MS.IT. Rangsit University
                                                 Semester 2/2011




                                        Lecture 5
                          Association Rule Mining

    by Kritsada Sriphaew (sriphaew.k AT gmail.com)

1
Topics
 Association rule mining
 Mining single-dimensional association rules
 Mining multilevel association rules
 Other measurements: interest and conviction
 Association rule mining to correlation analysis




 2                          Data Warehousing and Data Mining by Kritsada Sriphaew
What is Association Mining?
      Association rule mining:
          Finding frequent patterns, associations, correlations, or causal
           structures among sets of items or objects in transaction
           databases, relational databases, and other information
           repositories.
      Applications:
          Basket data analysis, cross-marketing, catalog design,
           clustering, classification, etc.
   Ex.: Rule form: “Body  Head [support, confidence]”
buys(x, “diapers*”)  Consequent [support, confidence]”
      “Antecedent  buys(x, “beers”)            [0.5%,60%]
major(x, “CS”)^takes(x, “DB”)  grade(x, “A”)                                [1%, 75%]
   3                                  Data Warehousing and Data Mining by Kritsada Sriphaew
A typical example of association rule mining is
market basket analysis.




4                      Data Warehousing and Data Mining by Kritsada Sriphaew
Rule Measures: Support/Confidence
       Find all the rules “Antecedent(s)  Consequent(s)” with minimum
        support and confidence
           support, s, probability that a transaction contains {A  C}
           confidence, c, conditional probability that a transaction having A also contains C
       Let min. sup. 50%, and min. conf. 50%,                 •   Support= 50% means that 50% of all
                                                                   transactions under analysis show that
           A  C (s=50%, c=66.7%)                                 A and C are purchased together
                                                               •   Confidence=66.7% means that 66.7% of the
           C  A (s=50%, c=100%)                                  customers who purchased A also bought C
       Typically association rules are considered
        interesting if they satisfy both
        a minimum support threshold and                             Transactional databases
        a mininum confidence threshold                             Transaction ID      Items Bought
       Such thresholds can be set by users                            2000            A,B,C
        or domain experts                                              1000            A,C
                                                                       4000            A,D
                                                                       5000            B,E,F

    5                                             Data Warehousing and Data Mining by Kritsada Sriphaew
Rule Measures: Support/Confidence
                                                                              probability
        TransID      Items Bought          Rule: A C
        T001         A,B,C
        T002         A,C                   support (AC) = P({AC}) = P(AC)
        T003         A,D                   confidence(AC) = P(C|A)
        T004         B,E,F                                  = P({AC})/P({A})

Frequency      •   A  B (1/4 = 25%, 1/3 = 33.3%)           Customer buys both (A&C)
A =3           •   B  A (1/4 = 25%, 1/2 = 50%)
                                                                      Customer buys diaper(C)
B =2           •   A  C (2/4 = 50%, 2/3 = 66.7%)
               •   C  A (2/4 = 50%, 2/2 =100%)
C =2
               •   A, B  C (1/4 = 25%, 1/1 = 100%)
AB = 1         •   A, C  B (1/4 = 25%, 1/2 = 50%)
AC = 2         •   B, C  A (1/4 = 25%, 1/1 = 100%)
BC = 1
ABC = 1
                                                             Customer buys beer (A)
    6                                           Data Warehousing and Data Mining by Kritsada Sriphaew
Association Rule: Support/Confidence for
  Relational Tables
         In case that each transaction is a row in a relational table
         Find: all rules that correlate the presence of one set of
          attributes with that of another set of attributes
                                   outlook     temp.   humidity   windy   Sponsor   play-time   play
• If temperature = hot
                                   sunny        hot      high     True     Sony        85       Y
  then humidity = high
  (s=3/10,c=3/5)                   sunny        hot      high     False     HP         90       Y

                                  overcast      hot    normal     True     Ford        63       Y
• If windy=true and play=Y         rainy       mild      high     True     Ford        5        N
  then humidity=high and
                                   rainy       cool      low      False     HP         56       Y
  outlook=overcast
  (s=2/10, c=2/4)                  sunny        hot      low      True     Sony        25       N

                                   rainy       cool    normal     True     Nokia       5        N
• If windy=true and play=Y        overcast     mild      high     True    Honda        86       Y
  and humidity=high
                                   rainy       mild      low      False    Ford        78       Y
  then outlook=overcast
  (s=2/10, c=2/3)                 overcast      hot      high     True     Sony        74       Y



      7                                      Data Warehousing and Data Mining by Kritsada Sriphaew
Association Rule Mining: Types
       Boolean vs. quantitative associations (Based on the types of
        values handled) (Single vs. multiple Dim.)
        SQLServer ^ DMBooks  DBMiner [0.2%, 60%]
        buys(x, “SQLServer”) ^ buys(x, “DMBook”) 
                             buys(x, “DBMiner”) [0.2%, 60%]
        age(x, “30..39”) ^ income(x, “42..48K”) 
                                    buys(x, “PC”) [1%, 75%]

       Single level vs. multilevel analysis
           What brands of beers are associated with what brands of diapers?
       Various extensions
           Maxpatterns and closed itemsets

    8                                     Data Warehousing and Data Mining by Kritsada Sriphaew
An Example (single dimensional Boolean
 association Rule Mining)
        For rule A  C:                                    Min. support 50%
            support = support({A, C}) = 50%                Min. confidence 50%
            confidence = support({A, C})/support({A}) = 66.7%
        The Apriori principle:
            Any subset of a frequent itemset must be
             frequent
Transaction ID           Items Bought              Frequent Itemset Support
    2000                 A,B,C                     {A}                 75%
    1000                 A,C                       {B}                 50%
    4000                 A,D                       {C}                 50%
    5000                 B,E,F                     {A,C}               50%

     9                                     Data Warehousing and Data Mining by Kritsada Sriphaew
Two Steps in Mining Association Rules
   A subset of a frequent itemset must also be a frequent
    itemset
      i.e., if {AB} is a frequent itemset, both {A} and {B} must
       be a frequent itemset
   Iteratively find frequent itemsets with cardinality from 1 to
    k (k-itemset)
Step1: Find the frequent itemsets: the sets of items that
         have minimum support

Step2: Use the frequent itemsets to generate association
       rules

 10                             Data Warehousing and Data Mining by Kritsada Sriphaew
Find the frequent itemsets

The Apriori Algorithm
    Join Step: Ck is generated by joining Lk-1with itself
    Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset
     of a frequent k-itemset
    Pseudo-code:
         Ck: Candidate itemset of size k
         Lk : frequent itemset of size k

         L1 = {frequent 1-itemsets};                  1
         for (k = 1; Lk !=f; k++) do begin
             Ck+1 = candidates generated from Lk;          2
             for each transaction t in database do
                increment the count of all candidates in Ck+1
                that are contained in t
                Lk+1 = candidates in Ck+1 with min_support
             end
         return Uk Lk;
    11                                Data Warehousing and Data Mining by Kritsada Sriphaew
The Apriori Algorithm — Example
     Database D              itemset sup.
                                              L1 itemset sup.
     TID   Items          C1    {1}   2            {1}        2
     100   134                  {2}   3            {2}        3
     200   235        Scan D    {3}   3            {3}        3
     300   1235                 {4}   1            {5}        3
     400   25                   {5}   3
                         C2 itemset sup            C2 itemset
     L2 itemset sup          {1   2}     1    Scan D    {1 2}
           {1 3}   2         {1   3}     2               {1   3}
           {2 3}   2         {1   5}     1               {1   5}
                             {2   3}     2               {2   3}
           {2 5}   3
                             {2   5}     3               {2   5}
           {3 5}   2
                             {3   5}     2               {3   5}
        C3 itemset       Scan D        L3 itemset sup
            {2 3 5}                        {2 3 5} 2
12
How to Generate Candidates?
   Suppose the items in Lk-1 are listed in an order
Step 1: self-joining Lk-1
   INSERT INTO Ck
   SELECT p.item1, p.item2, …,
          p.itemk-1, q.itemk-1
   FROM       Lk-1 p, Lk-1 q
   WHERE p.item1          = q.item1, …,
          p.itemk-2 = q.itemk-2,
          p.itemk-1 < q.itemk-1
Step 2: pruning
ForAll itemsets c IN Ck DO
 ForAll (k-1)-subsets s OF c DO
13
   IF (s is not in Lk-1) THEN DELETE c FROM Ck
Example of Generating Candidates
            L3={abc, abd, acd, ace, bcd}

Self-joining: L3×L3
abc + abd       abcd
acd + ace       acde

                  Pruning:
C4={abcd}         acde is removed because
                  ade is not in L3

 14                   Data Warehousing and Data Mining by Kritsada Sriphaew
How to Count Supports of Candidates?
   Why counting supports of candidates a problem?
       The total number of candidates can be very huge
       One transaction may contain many candidates
   Method:
       Candidate itemsets are stored in a hash-tree
       Leaf node of hash-tree contains a list of itemsets and
        counts
       Interior node contains a hash table
       Subset function: finds all the candidates contained in a
        transaction
15                                 Data Warehousing and Data Mining by Kritsada Sriphaew
Subset Function
     Subset function: finds all the candidates contained in a
      transaction. (1) Generate Hash Tree (2) Hashing each item in
      the transactions                   2 1

C2                              1          3    1+1                Database
     itemset                                                          TID      Items
       {1 2}                               5     1                    100      134
       {1 3}            f                                             200      235
       {1 5}                               3                          300      1235
                                                 1+1
       {2 3}                    2                                     400      25
       {2 5}                               5     1+1+1
       {3 5}
                                3          5     1+1

     16                             Data Warehousing and Data Mining by Kritsada Sriphaew
Is Apriori Fast Enough? — Performance
Bottlenecks
    The core of the Apriori algorithm:
        Use frequent (k – 1)-itemsets to generate candidate frequent
         k-itemsets
        Use database scan and pattern matching to collect counts for
         the candidate itemsets
    The bottleneck of Apriori: candidate generation
        Huge candidate sets:
          104 frequent 1-itemset will generate 107 candidate 2-
           itemsets
          To discover a frequent pattern of size 100, e.g., {a1, a2, …,
           a100}, one needs to generate 2100  1030 candidates.
        Multiple scans of database:
          Needs (n +1 ) scans, n is the length of the longest pattern

    17                               Data Warehousing and Data Mining by Kritsada Sriphaew
Mining Frequent Patterns Without Candidate
Generation
   Compress a large database into a compact,
    Frequent-Pattern tree (FP-tree) structure
       highly condensed, but complete for frequent pattern
        mining
       avoid costly database scans
   Develop an efficient, FP-tree-based frequent pattern
    mining method
       A divide-and-conquer methodology: decompose mining
        tasks into smaller ones
       Avoid candidate generation: sub-database test only!
18                                Data Warehousing and Data Mining by Kritsada Sriphaew
Construct FP-tree from Transaction DB
     TID    Items bought           (ordered) frequent items
     100    {f, a, c, d, g, i, m, p}      {f, c, a, m, p}
     200    {a, b, c, f, l, m, o}         {f, c, a, b, m}            min_support = 0.5
     300    {b, f, h, j, o}               {f, b}
     400    {b, c, k, s, p}               {c, b, p}
     500    {a, f, c, e, l, p, m, n}      {f, c, a, m, p}
                                                                                    {}
Steps:                               Header Table

1.    Scan DB once, find             Item frequency head                    f:4          c:1
      frequent 1-itemset              f      4
      (single item pattern)          c       4                          c:3       b:1    b:1
                                     a       3
2.    Order frequent items           b       3                          a:3              p:1
      in frequency                   m       3
      descending order               p       3                       m:2      b:1
3.    Scan DB again,
      construct FP-tree                                              p:2      m:1
      19                                   Data Warehousing and Data Mining by Kritsada Sriphaew
Mining Frequent Patterns using FP-tree
    General idea (divide-and-conquer)
        Recursively grow frequent pattern path using the FP-tree
    Method
        For each item, construct its conditional pattern-base, and then its
         conditional FP-tree
        Repeat the process on each newly created conditional FP-tree
        Until the resulting FP-tree is empty, or it contains only one path (single
         path will generate all the combinations of its sub-paths, each of which is a
         frequent pattern)
    Benefit: Completeness & Compactness
        Completeness: never breaks a long pattern of any transaction and
         preserves complete information for frequent pattern mining
        Compactness: reduces irrelevant information (infrequent items are gone),
         orders in frequency descending ordering (more frequent items are likely to
         be shared), and smaller than the original database.

    20                                     Data Warehousing and Data Mining by Kritsada Sriphaew
Step 1: From FP-tree to Conditional Pattern Base
    Starting at the frequent header table in the FP-tree
    Traverse the FP-tree by following the link of each frequent item
    Accumulate all of transformed prefix paths of that item to form a
     conditional pattern base

                                           {}
       Header Table
                                                      Conditional pattern bases
       Item frequency head         f:4          c:1   item      cond. pattn base
        f      4
                               c:3       b:1    b:1   c         f:3
       c       4
       a       3                                      a         fc:3
       b       3               a:3              p:1   b         fca:1, f:1, c:1
       m       3
       p       3             m:2     b:1
                                                      m         fca:2, fcab:1
                                                      p         fcam:2, cb:1
                             p:2     m:1
  21                           Knowledge Management and Discovery © Kritsada Sriphaew
Step 2: Construct Conditional FP-tree
    For each pattern-base
        Accumulate the count for each item in the base
        Construct the FP-tree for the frequent items of the pattern base
                                                           m-conditional pattern base:
                                              {}            fca:2,
Header Table
                                                            fcab:1
Item frequency head                   f:4          c:1
                                                                            All frequent
 f      4                                                                   patterns
c       4                         c:3       b:1    b:1    {}
                                                                            concerning m
a       3
b       3                         a:3              p:1    f:3               m,
m       3
                                                          c:3               fm, cm, am,
p       3                      m:2      b:1
                                                                            fcm, fam, cam,
                                                          a:3
                               p:2      m:1                                 fcam
    22
                                                         m-conditional FP-tree
Mining Frequent Patterns by
(Creating Conditional Pattern-Bases)

 Item Conditional pattern-base Conditional FP-tree
   p    {(fcam:2), (cb:1)}         {(c:3)}|p
     m    {(fca:2), (fcab:1)}              {(f:3, c:3, a:3)}|m
     b   {(fca:1), (f:1), (c:1)}                     Empty
     a          {(fc:3)}                       {(f:3, c:3)}|a
     c          {(f:3)}                            {(f:3)}|c
     f          Empty                                Empty


23                         Data Warehousing and Data Mining by Kritsada Sriphaew
Step 3: Recursively mine the conditional FP-
  tree                                {}

                                                         f:3
         {}
                                                         c:3
         f:3
                                               am-conditional FP-tree
         c:3
                                                         {}
         a:3
                                                         f:3
m-conditional FP-tree
                                               cm-conditional FP-tree

                                                         {}

                                                         f:3
                                              cam-conditional FP-tree

    24                  Data Warehousing and Data Mining by Kritsada Sriphaew
Single FP-tree Path Generation
    Suppose an FP-tree T has a single path P
    The complete set of frequent pattern of T can be generated by
     enumeration of all the combinations of the sub-paths of P
                                                     m-conditional pattern base:
                                        {}            fca:2,
Header Table
                                                      fcab:1
Item frequency head            f:4           c:1
                                                                All frequent
 f      4                                                       patterns
c       4                   c:3      b:1     b:1    {}
                                                                concerning m
a       3
b       3                   a:3              p:1    f:3         m,
m       3
                                                   c:3          fm, cm, am,
p       3                m:2      b:1
                                                                fcm, fam, cam,
                                                   a:3
                         p:2      m:1                           fcam
    25
                                                   m-conditional FP-tree
FP-growth vs. Apriori: Scalability With the
Support Threshold
                                             Data set T25I20D10K
                  100
                   90
                   80                         D1 FP-growth runtime
 Run time(sec.)




                   70                         D1 Apriori runtime
                   60
                   50
                   40
                   30
                   20
                   10
                    0
                        0      1                       2                          3
                            Support threshold(%)
26                                 Data Warehousing and Data Mining by Kritsada Sriphaew
CHARM - Mining Closed Association Rules
    Instead of horizontal DB format, vertical format is used.
    Instead of traditional frequent itemsets, closed frequent
     itemsets are mined.
           Horizontal DB                           Vertical DB
         Transaction   Items                  Items Transaction
         1             ABDE                   A          1345
         2             BCE                    B          123456
         3             ABDE                   C          2456
         4             ABCE                   D          1356
         5             ABCDE                  E          12345
         6             BCD
    27                           Data Warehousing and Data Mining by Kritsada Sriphaew
CHARM – Frequent Itemsets and Their Supports
   An example database and its frequent itemsets

          Items Trans.      Support                Itemsets

          A     1345        1.00                   B
          B     123456      0.83                   BE, E
          C     2456        0.67                   A, C, D, AB,AE,
          D     1356                               BC, BD, ABE
          E     12345       0.50                   AD, CE, DE,
                                                   ABD, ADE, BCE,
                                                   BDE, ABDE
Vertical DB
                                       Min. support = 0.5

 28                         Data Warehousing and Data Mining by Kritsada Sriphaew
CHARM - Closed Itemsets
   Closed frequent itemsets and their corresponding
    frequent itemsets
     Closed
     Itemsets    Tidsets      Sup. Freq. Itemsets
     B           123456       1.00        B
     BE          12345        0.83        BE, E
     ABE         1345         0.67        ABE, AB, AE, A
     BD          1356         0.67        BD, D
     BC          2456         0.67        BC, C
     ABDE        135          0.50        ABDE, ABD, ADE,
                                          BDE, AD, DE
     BCE         245          0.50        CE, BCE
29                          Data Warehousing and Data Mining by Kritsada Sriphaew
The CHARM Algorithm
 CHARM (? I  T, minsup);                                      CHARM-PROPERY(Nodes, NewN)
 1. Nodes = { Ij  t(Ij) : Ij  I  |t(Ij )|  minsup }          1. if (|Y|  minsup) then
 2. CHARM-EXTEND (Nodes, C)                                      2.    if t(Xi) = t(Xj) then       // Propery 1
                                                                 3.        Remove Xj from Nodes
 CHARM-EXTEND (Nodes, C)                                         4.        Replace all Xi with X’
 3. for each Xi  t(Xi) in Nodes                                 5.    else if t(Xj)  t(Xj) then // Propery 2
 4.    NewN = f and X = Xi                                       6.        Replace all Xi with X’
 5.    for each Xj  t(Xj) in Nodes, with f(j) > f(I)            7.    else if t(Xj)  t(Xj) then // Propery 3
 6.         X’ = X  Xj and Y = t(Xi)  t(Xj)                    8.        Remove Xj from Nodes
 7.         CHARM-PROPERTY(Nodes, NewN)                          9.        Add X  Y to NewN
 8.     if NewN  f then CHARM-EXTEND(NewN)                      10. else if t(Xj)  t(Xj) then // Propery 4
 9.     C = C  {X} // if X is not subsumed                      11.       Add X  Y to NewN
                                               f

          Ax1345      Bx123456                  Cx2456       Dx1356 Ex12345
            ABx1345
             ABEx1345


      ABCx45         ABDx135               BCx2456              BDx1356           BEx12345
                      ABDEx135

                                           BCDx56         BCEx245         BDEx135
30                                                        Data Warehousing and Data Mining by Kritsada Sriphaew
Presentation of Association Rules (Table Form)




31                    Data Warehousing and Data Mining by Kritsada Sriphaew
Visualization of Association Rule Using
Plane Graph




32
Visualization of Association Rule Using Rule
Graph




33
Mining multilevel association rules from transactional databases
Multiple-Level Association Rules
                                          TID                        ITEMS
    Items often form hierarchy.             T1               {1121, 1122, 1212}

    Items at the lower level are            T2            {1222, 1121, 1122, 1213}

     expected to have lower                  T3                  {1124, 1213}
                                             T4          {1111, 1211, 1232, 1221, 1223}
     support.
                                                               Food
    Rules regarding itemsets at                                (1)
     the appropriate levels could                 Milk                    Bread
     be quite useful.                             (11)                     (12)
    Transaction database can be     Skim             2%               Wheat        White
     encoded based on                (111)           (112)             (121)        (122)
     dimensions and levels
                                                                                      Wonder
    We can explore shared               Fraser             Sunset                    (1222)
     multi-level mining                  (1121)             (1124)
                                                                           Wonder
                                                                           (1213)
    34                                Data Warehousing and Data Mining by Kritsada Sriphaew
Mining Multi-Level Associations
   A top_down, progressive deepening approach:
       First find high-level strong rules:
               milk  bread            [20%, 60%]
       Then find their lower-level “weaker” rules:
               2% milk  wheat bread                        [6%, 50%]

   Variations at mining multiple-level association rules.
       Level-crossed association rules:
           2% milk  Wonder wheat bread                              [3%, 60%]
       Association rules with multiple, alternative hierarchies:
           2% milk  Wonder bread                      [8%, 72%]
35                                  Data Warehousing and Data Mining by Kritsada Sriphaew
Multi-level Association: Redundancy Filtering
 Some rules may be redundant due to “ancestor”
  relationships between items.
 Example
      milk  wheat bread           [s=8%, c=70%]
      2% milk  wheat bread               [s=2%, c=72%]
 We say the first rule is an ancestor of the second
  rule.
 A rule is redundant if its support is close to the
  “expected” value, based on the rule’s ancestor.
 36                        Data Warehousing and Data Mining by Kritsada Sriphaew
Multi-Level Mining: Progressive Deepening
    A top-down, progressive deepening approach:
        First mine high-level frequent items:
          milk (15%), bread (10%)
        Then mine their lower-level “weaker” frequent itemsets:
          2% milk (5%), wheat bread (4%)
    Different min_support threshold across multi-levels lead
     to different algorithms:
        If adopting the same min_support across multi-levels
           then toss t if any of t’s ancestors is infrequent.
        If adopting reduced min_support at lower levels
           then examine only those descendants whose ancestor’s
            support is frequent/non-negligible.

    37                               Data Warehousing and Data Mining by Kritsada Sriphaew
Problem of Confidence
    Example: (Aggarwal & Yu, PODS98)
        Among 5000 students
            3000 play basketball
            3750 eat cereal
            2000 both play basket ball and eat cereal
        play basketball  eat cereal [40%, 66.7%] is misleading because the overall
         percentage of students eating cereal is 75% which is higher than 66.7%.
        play basketball  not eat cereal [20%, 33.3%] is far more accurate, although
         with lower support and confidence


                               basketball not basketball sum(row)
                    cereal          2000           1750     3750
                    not cereal      1000            250     1250
                    sum(col.)       3000           2000     5000
    38                                           Data Warehousing and Data Mining by Kritsada Sriphaew
Interest/Lift/Correlation
    Interest (or lift, correlation)
        taking both P(A) and P(B) in consideration                      P( A  B)
        P(AB)=P(B)P(A), if A and B are independent                    P( A) P( B)
         events
        A and B negatively correlated, if the value is less
         than 1; otherwise A and B positively correlated
                                                                     2000
                     basketball not basketball sum(row)              5000
          cereal          2000           1750     3750                       0.889
                                                                  3000 3750
          not cereal      1000            250     1250                
                                                                  5000 5000
          sum(col.)       3000           2000     5000

                                                                      1000
        Lift(play basketball  eat cereal) = 0.89                    5000    1.33
        Lift(play basketball  not eat cereal) = 1.33             3000 1250
                                                                        
                                                                   5000 5000
    39                                    Data Warehousing and Data Mining by Kritsada Sriphaew
Conviction
    Conviction (Brin, 1997)
                                                                       (1  Support ( B))
                                      Conviction ( A  B) 
        0 <= conv(AB) <=                                       (1  Confidence ( A  B)
        A and B are statistically independent if and only if
         conv(AB) = 1
        0 < conv(AB) < 1 if and only if p(B|A) < p(B)
         B is negatively correlated with A.
        1 < conv(AB) <  if and only if p(B|A) > p(B)
         B is positively correlated with A.                                      1
                                                                                    3750
                basketball not basketball sum(row)                                  5000  0.375
     cereal          2000           1750     3750                               1  0.667
     not cereal      1000            250     1250
                                                                                      1250
     sum(col.)       3000           2000     5000                                 1
                                                                                      5000  2.25
         conviction(play basketball  eat cereal) = 0.375
                                                                                  1  0.333
         conviction(play basketball  not eat cereal) = 2.25
    40                                        Data Warehousing and Data Mining by Kritsada Sriphaew
From Association Mining to Correlation Analysis

    Ex. Strong rules are not necessarily interesting
         Of 10000 transactions
             • 6000 customer transactions include computer games
             • 7500 customer transactions include videos
             • 4000 customer transactions include both computer game and video

                                             • Suppose that data mining program for
             videos           games
                                               discovering association rules is run on
                                               the data, using min_sup of 30% and
                                               min_conf. of 60%
                                             • The following association rule is
                                               discovered:
                      4,000

         buys(X, “computer games”)  buys(X, “videos”)
                        [s=40%, c=66%]
    41
                =4000/10000           =4000/6000
A misleading “strong” association rule

         buys(X, “computer games”)  buys(X, “videos”)
                     [support=40%, confidence=66%]

    This rule is misleading because the probability of purchasing video is
     75% (>66%)
    In fact, computer games and videos are negatively associated because
     the purchase of one of these items actually decreases the likelihood of
     purchasing the other. Therefore, we could easily make unwise business
     decisions based on this rule



    42
                                      Data Warehousing and Data Mining by Kritsada Sriphaew
From Association Analysis to Correlation
Analysis
    To help filter out misleading “strong” association
    Correlation rules
        A  B [support, confidence, correlation]
    Lift is a simple correlation measure that is given as follows
        The occurrence of itemset A is independent of the occurrence of itemset B if
         P(AB) = P(A)P(B);
        Otherwise, itemset A and B are dependent and correlated
        lift(A,B) = P(AB) / P(A)P(B) = P(B|A) / P(B) = conf(AB) / sup(B)
        If lift(A,B) < 1, then the occurrence of A is negatively correlated with the
         occurrence of B
        If lift(A,B) > 1, then A and B is positively correlated, meaning that the occurrence
         of one implies the occurrence of the other.


    43
                                               Data Warehousing and Data Mining by Kritsada Sriphaew
From Association Analysis to Correlation
Analysis (Cont.)
    Ex. Correlation analysis using lift

         buys(X, “computer games”)  buys(X, “videos”)
                     [support=40%, confidence=66%]

        The lift of this rule is
         P{game,video} / (P{game} × P{video}) = 0.40/(0.6 ×0.75) = 0.89
        There is a negative correlation between the occurrence of {game} and {video}



    Ex. Is this following rule misleading?
    Buy walnuts  Buy milk [1%, 80%]”
        if 85% of customers buy milk

    44
                                               Data Warehousing and Data Mining by Kritsada Sriphaew
Homework
   ให้ transactional database ซึ่งเป็น LOG ไฟล์บันทึกการเข้าเยี่ยมชมเว็บเพจของปู้ใช้แต่ละคน
    ในช่วงระยะเวลาหนึ่ง จงหากฎสัมพันธ์ที่น่าเชื่อถือ โดยสมมติว่าเราเป็นปู้วเิ คราะห์ข้อมูล มี
    สิทธิตั้ง minimum support และ minimum confidence ด้วยตัวเอง พร้อมอธิบายเหตุปล
    ประกอบว่าทาไมถึงตั้งค่านั้น และตรวจสอบด้วยว่ากฎเหล่านั้นเป็น misleading หรือไม่ ถ้ามีให้
    แก้ไขอย่างไร
                       TID                          List of items
                      T001          P1, P2, P3, P4
                      T002          P3, P6
                      T003          P2, P5, P1
                      T004          P5, P4, P3,P6
                      T005          P1, P3, P4, P2


                     P1            P2                        P4                       P6
                                               P3                        P5
    45
Feb 26, 2011 (14:00)
   Quiz I
       Star-net Query (Multidimensional Table)
       Data Cube Computation (Memory Calculation)
       Data Preprocessing (Normalization, Smoothing by binning)
       Association Rule Mining




46                               Data Warehousing and Data Mining by Kritsada Sriphaew

Más contenido relacionado

Destacado

Focused Clustering and Outlier Detection in Large Attributed Graphs
Focused Clustering and Outlier Detection in Large Attributed GraphsFocused Clustering and Outlier Detection in Large Attributed Graphs
Focused Clustering and Outlier Detection in Large Attributed GraphsBryan Perozzi
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data WarehousingAswathy S Nair
 
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...Salah Amean
 
Apache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouseApache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouseYang Li
 
Design cube in Apache Kylin
Design cube in Apache KylinDesign cube in Apache Kylin
Design cube in Apache KylinYang Li
 
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Salah Amean
 
1.11.association mining 3
1.11.association mining 31.11.association mining 3
1.11.association mining 3Krish_ver2
 
Analysis of Feature Selection Algorithms (Branch & Bound and Beam search)
Analysis of Feature Selection Algorithms (Branch & Bound and Beam search)Analysis of Feature Selection Algorithms (Branch & Bound and Beam search)
Analysis of Feature Selection Algorithms (Branch & Bound and Beam search)Parinda Rajapaksha
 
Apache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBaseApache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBaseHBaseCon
 

Destacado (20)

Dbm630 lecture10
Dbm630 lecture10Dbm630 lecture10
Dbm630 lecture10
 
Dbm630_lecture01
Dbm630_lecture01Dbm630_lecture01
Dbm630_lecture01
 
Dbm630 lecture07
Dbm630 lecture07Dbm630 lecture07
Dbm630 lecture07
 
Dbm630 lecture04
Dbm630 lecture04Dbm630 lecture04
Dbm630 lecture04
 
Focused Clustering and Outlier Detection in Large Attributed Graphs
Focused Clustering and Outlier Detection in Large Attributed GraphsFocused Clustering and Outlier Detection in Large Attributed Graphs
Focused Clustering and Outlier Detection in Large Attributed Graphs
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
 
Dbm630 lecture08
Dbm630 lecture08Dbm630 lecture08
Dbm630 lecture08
 
Datawarehouse and OLAP
Datawarehouse and OLAPDatawarehouse and OLAP
Datawarehouse and OLAP
 
Dbm630_lecture02-03
Dbm630_lecture02-03Dbm630_lecture02-03
Dbm630_lecture02-03
 
Dbm630 lecture09
Dbm630 lecture09Dbm630 lecture09
Dbm630 lecture09
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data Warehousing
 
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
 
Apache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouseApache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouse
 
Design cube in Apache Kylin
Design cube in Apache KylinDesign cube in Apache Kylin
Design cube in Apache Kylin
 
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
 
Datacube
DatacubeDatacube
Datacube
 
1.11.association mining 3
1.11.association mining 31.11.association mining 3
1.11.association mining 3
 
Analysis of Feature Selection Algorithms (Branch & Bound and Beam search)
Analysis of Feature Selection Algorithms (Branch & Bound and Beam search)Analysis of Feature Selection Algorithms (Branch & Bound and Beam search)
Analysis of Feature Selection Algorithms (Branch & Bound and Beam search)
 
Apache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBaseApache Kylin’s Performance Boost from Apache HBase
Apache Kylin’s Performance Boost from Apache HBase
 
Apriori Algorithm
Apriori AlgorithmApriori Algorithm
Apriori Algorithm
 

Similar a Dbm630 lecture05

Dwh lecture slides-week15
Dwh lecture slides-week15Dwh lecture slides-week15
Dwh lecture slides-week15Shani729
 
Association Analysis in Data Mining
Association Analysis in Data MiningAssociation Analysis in Data Mining
Association Analysis in Data MiningKamal Acharya
 
Association rule mining
Association rule miningAssociation rule mining
Association rule miningAcad
 
Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsJustin Cletus
 
Frequent Pattern Analysis, Apriori and FP Growth Algorithm
Frequent Pattern Analysis, Apriori and FP Growth AlgorithmFrequent Pattern Analysis, Apriori and FP Growth Algorithm
Frequent Pattern Analysis, Apriori and FP Growth AlgorithmShivarkarSandip
 
Lect7 Association analysis to correlation analysis
Lect7 Association analysis to correlation analysisLect7 Association analysis to correlation analysis
Lect7 Association analysis to correlation analysishktripathy
 
Data Mining Concepts 15061
Data Mining Concepts 15061Data Mining Concepts 15061
Data Mining Concepts 15061badirh
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining Conceptsdataminers.ir
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining ConceptsDung Nguyen
 
Pert 06 association rules
Pert 06 association rulesPert 06 association rules
Pert 06 association rulesaiiniR
 
Discovering Target-Branched Declare Constraints
Discovering Target-Branched Declare ConstraintsDiscovering Target-Branched Declare Constraints
Discovering Target-Branched Declare ConstraintsClaudio Di Ciccio
 
Data mining presentation.ppt
Data mining presentation.pptData mining presentation.ppt
Data mining presentation.pptneelamoberoi1030
 
The comparative study of apriori and FP-growth algorithm
The comparative study of apriori and FP-growth algorithmThe comparative study of apriori and FP-growth algorithm
The comparative study of apriori and FP-growth algorithmdeepti92pawar
 

Similar a Dbm630 lecture05 (20)

Dwh lecture slides-week15
Dwh lecture slides-week15Dwh lecture slides-week15
Dwh lecture slides-week15
 
My6asso
My6assoMy6asso
My6asso
 
Association Analysis in Data Mining
Association Analysis in Data MiningAssociation Analysis in Data Mining
Association Analysis in Data Mining
 
6asso
6asso6asso
6asso
 
Association-Analysis.pdf
Association-Analysis.pdfAssociation-Analysis.pdf
Association-Analysis.pdf
 
Association rule mining
Association rule miningAssociation rule mining
Association rule mining
 
Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and Correlations
 
Frequent Pattern Analysis, Apriori and FP Growth Algorithm
Frequent Pattern Analysis, Apriori and FP Growth AlgorithmFrequent Pattern Analysis, Apriori and FP Growth Algorithm
Frequent Pattern Analysis, Apriori and FP Growth Algorithm
 
Data Mining Lecture_4.pptx
Data Mining Lecture_4.pptxData Mining Lecture_4.pptx
Data Mining Lecture_4.pptx
 
06FPBasic02.pdf
06FPBasic02.pdf06FPBasic02.pdf
06FPBasic02.pdf
 
pattern mninng.ppt
pattern mninng.pptpattern mninng.ppt
pattern mninng.ppt
 
Lect7 Association analysis to correlation analysis
Lect7 Association analysis to correlation analysisLect7 Association analysis to correlation analysis
Lect7 Association analysis to correlation analysis
 
Data Mining Concepts 15061
Data Mining Concepts 15061Data Mining Concepts 15061
Data Mining Concepts 15061
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining Concepts
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining Concepts
 
Pert 06 association rules
Pert 06 association rulesPert 06 association rules
Pert 06 association rules
 
Discovering Target-Branched Declare Constraints
Discovering Target-Branched Declare ConstraintsDiscovering Target-Branched Declare Constraints
Discovering Target-Branched Declare Constraints
 
Hiding slides
Hiding slidesHiding slides
Hiding slides
 
Data mining presentation.ppt
Data mining presentation.pptData mining presentation.ppt
Data mining presentation.ppt
 
The comparative study of apriori and FP-growth algorithm
The comparative study of apriori and FP-growth algorithmThe comparative study of apriori and FP-growth algorithm
The comparative study of apriori and FP-growth algorithm
 

Más de Tokyo Institute of Technology (11)

Lecture 4 online and offline business model generation
Lecture 4 online and offline business model generationLecture 4 online and offline business model generation
Lecture 4 online and offline business model generation
 
Lecture 4: Brand Creation
Lecture 4: Brand CreationLecture 4: Brand Creation
Lecture 4: Brand Creation
 
Lecture3 ExperientialMarketing
Lecture3 ExperientialMarketingLecture3 ExperientialMarketing
Lecture3 ExperientialMarketing
 
Lecture3 Tools and Content Creation
Lecture3 Tools and Content CreationLecture3 Tools and Content Creation
Lecture3 Tools and Content Creation
 
Lecture2: Innovation Workshop
Lecture2: Innovation WorkshopLecture2: Innovation Workshop
Lecture2: Innovation Workshop
 
Lecture0: introduction Online Marketing
Lecture0: introduction Online MarketingLecture0: introduction Online Marketing
Lecture0: introduction Online Marketing
 
Lecture2: Marketing and Social Media
Lecture2: Marketing and Social MediaLecture2: Marketing and Social Media
Lecture2: Marketing and Social Media
 
Lecture1: E-Commerce Business Model
Lecture1: E-Commerce Business ModelLecture1: E-Commerce Business Model
Lecture1: E-Commerce Business Model
 
Lecture0: Introduction Social Commerce
Lecture0: Introduction Social CommerceLecture0: Introduction Social Commerce
Lecture0: Introduction Social Commerce
 
Dbm630 lecture06
Dbm630 lecture06Dbm630 lecture06
Dbm630 lecture06
 
Coursesyllabus_dbm630
Coursesyllabus_dbm630Coursesyllabus_dbm630
Coursesyllabus_dbm630
 

Último

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 

Último (20)

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

Dbm630 lecture05

  • 1. DBM630: Data Mining and Data Warehousing MS.IT. Rangsit University Semester 2/2011 Lecture 5 Association Rule Mining by Kritsada Sriphaew (sriphaew.k AT gmail.com) 1
  • 2. Topics  Association rule mining  Mining single-dimensional association rules  Mining multilevel association rules  Other measurements: interest and conviction  Association rule mining to correlation analysis 2 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 3. What is Association Mining?  Association rule mining:  Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories.  Applications:  Basket data analysis, cross-marketing, catalog design, clustering, classification, etc.  Ex.: Rule form: “Body  Head [support, confidence]” buys(x, “diapers*”)  Consequent [support, confidence]” “Antecedent  buys(x, “beers”) [0.5%,60%] major(x, “CS”)^takes(x, “DB”)  grade(x, “A”) [1%, 75%] 3 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 4. A typical example of association rule mining is market basket analysis. 4 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 5. Rule Measures: Support/Confidence  Find all the rules “Antecedent(s)  Consequent(s)” with minimum support and confidence  support, s, probability that a transaction contains {A  C}  confidence, c, conditional probability that a transaction having A also contains C  Let min. sup. 50%, and min. conf. 50%, • Support= 50% means that 50% of all transactions under analysis show that  A  C (s=50%, c=66.7%) A and C are purchased together • Confidence=66.7% means that 66.7% of the  C  A (s=50%, c=100%) customers who purchased A also bought C  Typically association rules are considered interesting if they satisfy both a minimum support threshold and Transactional databases a mininum confidence threshold Transaction ID Items Bought  Such thresholds can be set by users 2000 A,B,C or domain experts 1000 A,C 4000 A,D 5000 B,E,F 5 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 6. Rule Measures: Support/Confidence probability TransID Items Bought Rule: A C T001 A,B,C T002 A,C support (AC) = P({AC}) = P(AC) T003 A,D confidence(AC) = P(C|A) T004 B,E,F = P({AC})/P({A}) Frequency • A  B (1/4 = 25%, 1/3 = 33.3%) Customer buys both (A&C) A =3 • B  A (1/4 = 25%, 1/2 = 50%) Customer buys diaper(C) B =2 • A  C (2/4 = 50%, 2/3 = 66.7%) • C  A (2/4 = 50%, 2/2 =100%) C =2 • A, B  C (1/4 = 25%, 1/1 = 100%) AB = 1 • A, C  B (1/4 = 25%, 1/2 = 50%) AC = 2 • B, C  A (1/4 = 25%, 1/1 = 100%) BC = 1 ABC = 1 Customer buys beer (A) 6 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 7. Association Rule: Support/Confidence for Relational Tables  In case that each transaction is a row in a relational table  Find: all rules that correlate the presence of one set of attributes with that of another set of attributes outlook temp. humidity windy Sponsor play-time play • If temperature = hot sunny hot high True Sony 85 Y then humidity = high (s=3/10,c=3/5) sunny hot high False HP 90 Y overcast hot normal True Ford 63 Y • If windy=true and play=Y rainy mild high True Ford 5 N then humidity=high and rainy cool low False HP 56 Y outlook=overcast (s=2/10, c=2/4) sunny hot low True Sony 25 N rainy cool normal True Nokia 5 N • If windy=true and play=Y overcast mild high True Honda 86 Y and humidity=high rainy mild low False Ford 78 Y then outlook=overcast (s=2/10, c=2/3) overcast hot high True Sony 74 Y 7 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 8. Association Rule Mining: Types  Boolean vs. quantitative associations (Based on the types of values handled) (Single vs. multiple Dim.) SQLServer ^ DMBooks  DBMiner [0.2%, 60%] buys(x, “SQLServer”) ^ buys(x, “DMBook”)  buys(x, “DBMiner”) [0.2%, 60%] age(x, “30..39”) ^ income(x, “42..48K”)  buys(x, “PC”) [1%, 75%]  Single level vs. multilevel analysis  What brands of beers are associated with what brands of diapers?  Various extensions  Maxpatterns and closed itemsets 8 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 9. An Example (single dimensional Boolean association Rule Mining)  For rule A  C: Min. support 50%  support = support({A, C}) = 50% Min. confidence 50%  confidence = support({A, C})/support({A}) = 66.7%  The Apriori principle:  Any subset of a frequent itemset must be frequent Transaction ID Items Bought Frequent Itemset Support 2000 A,B,C {A} 75% 1000 A,C {B} 50% 4000 A,D {C} 50% 5000 B,E,F {A,C} 50% 9 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 10. Two Steps in Mining Association Rules  A subset of a frequent itemset must also be a frequent itemset  i.e., if {AB} is a frequent itemset, both {A} and {B} must be a frequent itemset  Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset) Step1: Find the frequent itemsets: the sets of items that have minimum support Step2: Use the frequent itemsets to generate association rules 10 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 11. Find the frequent itemsets The Apriori Algorithm  Join Step: Ck is generated by joining Lk-1with itself  Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset  Pseudo-code: Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent 1-itemsets}; 1 for (k = 1; Lk !=f; k++) do begin Ck+1 = candidates generated from Lk; 2 for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return Uk Lk; 11 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 12. The Apriori Algorithm — Example Database D itemset sup. L1 itemset sup. TID Items C1 {1} 2 {1} 2 100 134 {2} 3 {2} 3 200 235 Scan D {3} 3 {3} 3 300 1235 {4} 1 {5} 3 400 25 {5} 3 C2 itemset sup C2 itemset L2 itemset sup {1 2} 1 Scan D {1 2} {1 3} 2 {1 3} 2 {1 3} {2 3} 2 {1 5} 1 {1 5} {2 3} 2 {2 3} {2 5} 3 {2 5} 3 {2 5} {3 5} 2 {3 5} 2 {3 5} C3 itemset Scan D L3 itemset sup {2 3 5} {2 3 5} 2 12
  • 13. How to Generate Candidates?  Suppose the items in Lk-1 are listed in an order Step 1: self-joining Lk-1 INSERT INTO Ck SELECT p.item1, p.item2, …, p.itemk-1, q.itemk-1 FROM Lk-1 p, Lk-1 q WHERE p.item1 = q.item1, …, p.itemk-2 = q.itemk-2, p.itemk-1 < q.itemk-1 Step 2: pruning ForAll itemsets c IN Ck DO ForAll (k-1)-subsets s OF c DO 13 IF (s is not in Lk-1) THEN DELETE c FROM Ck
  • 14. Example of Generating Candidates L3={abc, abd, acd, ace, bcd} Self-joining: L3×L3 abc + abd  abcd acd + ace  acde Pruning: C4={abcd} acde is removed because ade is not in L3 14 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 15. How to Count Supports of Candidates?  Why counting supports of candidates a problem?  The total number of candidates can be very huge  One transaction may contain many candidates  Method:  Candidate itemsets are stored in a hash-tree  Leaf node of hash-tree contains a list of itemsets and counts  Interior node contains a hash table  Subset function: finds all the candidates contained in a transaction 15 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 16. Subset Function  Subset function: finds all the candidates contained in a transaction. (1) Generate Hash Tree (2) Hashing each item in the transactions 2 1 C2 1 3 1+1 Database itemset TID Items {1 2} 5 1 100 134 {1 3} f 200 235 {1 5} 3 300 1235 1+1 {2 3} 2 400 25 {2 5} 5 1+1+1 {3 5} 3 5 1+1 16 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 17. Is Apriori Fast Enough? — Performance Bottlenecks  The core of the Apriori algorithm:  Use frequent (k – 1)-itemsets to generate candidate frequent k-itemsets  Use database scan and pattern matching to collect counts for the candidate itemsets  The bottleneck of Apriori: candidate generation  Huge candidate sets:  104 frequent 1-itemset will generate 107 candidate 2- itemsets  To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100}, one needs to generate 2100  1030 candidates.  Multiple scans of database:  Needs (n +1 ) scans, n is the length of the longest pattern 17 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 18. Mining Frequent Patterns Without Candidate Generation  Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure  highly condensed, but complete for frequent pattern mining  avoid costly database scans  Develop an efficient, FP-tree-based frequent pattern mining method  A divide-and-conquer methodology: decompose mining tasks into smaller ones  Avoid candidate generation: sub-database test only! 18 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 19. Construct FP-tree from Transaction DB TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p} {f, c, a, m, p} 200 {a, b, c, f, l, m, o} {f, c, a, b, m} min_support = 0.5 300 {b, f, h, j, o} {f, b} 400 {b, c, k, s, p} {c, b, p} 500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} {} Steps: Header Table 1. Scan DB once, find Item frequency head f:4 c:1 frequent 1-itemset f 4 (single item pattern) c 4 c:3 b:1 b:1 a 3 2. Order frequent items b 3 a:3 p:1 in frequency m 3 descending order p 3 m:2 b:1 3. Scan DB again, construct FP-tree p:2 m:1 19 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 20. Mining Frequent Patterns using FP-tree  General idea (divide-and-conquer)  Recursively grow frequent pattern path using the FP-tree  Method  For each item, construct its conditional pattern-base, and then its conditional FP-tree  Repeat the process on each newly created conditional FP-tree  Until the resulting FP-tree is empty, or it contains only one path (single path will generate all the combinations of its sub-paths, each of which is a frequent pattern)  Benefit: Completeness & Compactness  Completeness: never breaks a long pattern of any transaction and preserves complete information for frequent pattern mining  Compactness: reduces irrelevant information (infrequent items are gone), orders in frequency descending ordering (more frequent items are likely to be shared), and smaller than the original database. 20 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 21. Step 1: From FP-tree to Conditional Pattern Base  Starting at the frequent header table in the FP-tree  Traverse the FP-tree by following the link of each frequent item  Accumulate all of transformed prefix paths of that item to form a conditional pattern base {} Header Table Conditional pattern bases Item frequency head f:4 c:1 item cond. pattn base f 4 c:3 b:1 b:1 c f:3 c 4 a 3 a fc:3 b 3 a:3 p:1 b fca:1, f:1, c:1 m 3 p 3 m:2 b:1 m fca:2, fcab:1 p fcam:2, cb:1 p:2 m:1 21 Knowledge Management and Discovery © Kritsada Sriphaew
  • 22. Step 2: Construct Conditional FP-tree  For each pattern-base  Accumulate the count for each item in the base  Construct the FP-tree for the frequent items of the pattern base m-conditional pattern base: {} fca:2, Header Table fcab:1 Item frequency head f:4 c:1 All frequent f 4 patterns c 4 c:3 b:1 b:1 {} concerning m a 3 b 3 a:3 p:1 f:3 m, m 3 c:3 fm, cm, am, p 3 m:2 b:1 fcm, fam, cam, a:3 p:2 m:1 fcam 22 m-conditional FP-tree
  • 23. Mining Frequent Patterns by (Creating Conditional Pattern-Bases) Item Conditional pattern-base Conditional FP-tree p {(fcam:2), (cb:1)} {(c:3)}|p m {(fca:2), (fcab:1)} {(f:3, c:3, a:3)}|m b {(fca:1), (f:1), (c:1)} Empty a {(fc:3)} {(f:3, c:3)}|a c {(f:3)} {(f:3)}|c f Empty Empty 23 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 24. Step 3: Recursively mine the conditional FP- tree {} f:3 {} c:3 f:3 am-conditional FP-tree c:3 {} a:3 f:3 m-conditional FP-tree cm-conditional FP-tree {} f:3 cam-conditional FP-tree 24 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 25. Single FP-tree Path Generation  Suppose an FP-tree T has a single path P  The complete set of frequent pattern of T can be generated by enumeration of all the combinations of the sub-paths of P m-conditional pattern base: {} fca:2, Header Table fcab:1 Item frequency head f:4 c:1 All frequent f 4 patterns c 4 c:3 b:1 b:1 {} concerning m a 3 b 3 a:3 p:1 f:3 m, m 3 c:3 fm, cm, am, p 3 m:2 b:1 fcm, fam, cam, a:3 p:2 m:1 fcam 25 m-conditional FP-tree
  • 26. FP-growth vs. Apriori: Scalability With the Support Threshold Data set T25I20D10K 100 90 80 D1 FP-growth runtime Run time(sec.) 70 D1 Apriori runtime 60 50 40 30 20 10 0 0 1 2 3 Support threshold(%) 26 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 27. CHARM - Mining Closed Association Rules  Instead of horizontal DB format, vertical format is used.  Instead of traditional frequent itemsets, closed frequent itemsets are mined. Horizontal DB Vertical DB Transaction Items Items Transaction 1 ABDE A 1345 2 BCE B 123456 3 ABDE C 2456 4 ABCE D 1356 5 ABCDE E 12345 6 BCD 27 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 28. CHARM – Frequent Itemsets and Their Supports  An example database and its frequent itemsets Items Trans. Support Itemsets A 1345 1.00 B B 123456 0.83 BE, E C 2456 0.67 A, C, D, AB,AE, D 1356 BC, BD, ABE E 12345 0.50 AD, CE, DE, ABD, ADE, BCE, BDE, ABDE Vertical DB Min. support = 0.5 28 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 29. CHARM - Closed Itemsets  Closed frequent itemsets and their corresponding frequent itemsets Closed Itemsets Tidsets Sup. Freq. Itemsets B 123456 1.00 B BE 12345 0.83 BE, E ABE 1345 0.67 ABE, AB, AE, A BD 1356 0.67 BD, D BC 2456 0.67 BC, C ABDE 135 0.50 ABDE, ABD, ADE, BDE, AD, DE BCE 245 0.50 CE, BCE 29 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 30. The CHARM Algorithm CHARM (? I  T, minsup); CHARM-PROPERY(Nodes, NewN) 1. Nodes = { Ij  t(Ij) : Ij  I  |t(Ij )|  minsup } 1. if (|Y|  minsup) then 2. CHARM-EXTEND (Nodes, C) 2. if t(Xi) = t(Xj) then // Propery 1 3. Remove Xj from Nodes CHARM-EXTEND (Nodes, C) 4. Replace all Xi with X’ 3. for each Xi  t(Xi) in Nodes 5. else if t(Xj)  t(Xj) then // Propery 2 4. NewN = f and X = Xi 6. Replace all Xi with X’ 5. for each Xj  t(Xj) in Nodes, with f(j) > f(I) 7. else if t(Xj)  t(Xj) then // Propery 3 6. X’ = X  Xj and Y = t(Xi)  t(Xj) 8. Remove Xj from Nodes 7. CHARM-PROPERTY(Nodes, NewN) 9. Add X  Y to NewN 8. if NewN  f then CHARM-EXTEND(NewN) 10. else if t(Xj)  t(Xj) then // Propery 4 9. C = C  {X} // if X is not subsumed 11. Add X  Y to NewN f Ax1345 Bx123456 Cx2456 Dx1356 Ex12345 ABx1345 ABEx1345 ABCx45 ABDx135 BCx2456 BDx1356 BEx12345 ABDEx135 BCDx56 BCEx245 BDEx135 30 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 31. Presentation of Association Rules (Table Form) 31 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 32. Visualization of Association Rule Using Plane Graph 32
  • 33. Visualization of Association Rule Using Rule Graph 33
  • 34. Mining multilevel association rules from transactional databases Multiple-Level Association Rules TID ITEMS  Items often form hierarchy. T1 {1121, 1122, 1212}  Items at the lower level are T2 {1222, 1121, 1122, 1213} expected to have lower T3 {1124, 1213} T4 {1111, 1211, 1232, 1221, 1223} support. Food  Rules regarding itemsets at (1) the appropriate levels could Milk Bread be quite useful. (11) (12)  Transaction database can be Skim 2% Wheat White encoded based on (111) (112) (121) (122) dimensions and levels Wonder  We can explore shared Fraser Sunset (1222) multi-level mining (1121) (1124) Wonder (1213) 34 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 35. Mining Multi-Level Associations  A top_down, progressive deepening approach:  First find high-level strong rules:  milk  bread [20%, 60%]  Then find their lower-level “weaker” rules:  2% milk  wheat bread [6%, 50%]  Variations at mining multiple-level association rules.  Level-crossed association rules: 2% milk  Wonder wheat bread [3%, 60%]  Association rules with multiple, alternative hierarchies: 2% milk  Wonder bread [8%, 72%] 35 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 36. Multi-level Association: Redundancy Filtering  Some rules may be redundant due to “ancestor” relationships between items.  Example milk  wheat bread [s=8%, c=70%] 2% milk  wheat bread [s=2%, c=72%]  We say the first rule is an ancestor of the second rule.  A rule is redundant if its support is close to the “expected” value, based on the rule’s ancestor. 36 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 37. Multi-Level Mining: Progressive Deepening  A top-down, progressive deepening approach:  First mine high-level frequent items:  milk (15%), bread (10%)  Then mine their lower-level “weaker” frequent itemsets:  2% milk (5%), wheat bread (4%)  Different min_support threshold across multi-levels lead to different algorithms:  If adopting the same min_support across multi-levels  then toss t if any of t’s ancestors is infrequent.  If adopting reduced min_support at lower levels  then examine only those descendants whose ancestor’s support is frequent/non-negligible. 37 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 38. Problem of Confidence  Example: (Aggarwal & Yu, PODS98)  Among 5000 students  3000 play basketball  3750 eat cereal  2000 both play basket ball and eat cereal  play basketball  eat cereal [40%, 66.7%] is misleading because the overall percentage of students eating cereal is 75% which is higher than 66.7%.  play basketball  not eat cereal [20%, 33.3%] is far more accurate, although with lower support and confidence basketball not basketball sum(row) cereal 2000 1750 3750 not cereal 1000 250 1250 sum(col.) 3000 2000 5000 38 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 39. Interest/Lift/Correlation  Interest (or lift, correlation)  taking both P(A) and P(B) in consideration P( A  B)  P(AB)=P(B)P(A), if A and B are independent P( A) P( B) events  A and B negatively correlated, if the value is less than 1; otherwise A and B positively correlated 2000 basketball not basketball sum(row) 5000 cereal 2000 1750 3750  0.889 3000 3750 not cereal 1000 250 1250  5000 5000 sum(col.) 3000 2000 5000 1000  Lift(play basketball  eat cereal) = 0.89 5000  1.33  Lift(play basketball  not eat cereal) = 1.33 3000 1250  5000 5000 39 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 40. Conviction  Conviction (Brin, 1997) (1  Support ( B))  Conviction ( A  B)   0 <= conv(AB) <=  (1  Confidence ( A  B)  A and B are statistically independent if and only if conv(AB) = 1  0 < conv(AB) < 1 if and only if p(B|A) < p(B) B is negatively correlated with A.  1 < conv(AB) <  if and only if p(B|A) > p(B) B is positively correlated with A. 1 3750 basketball not basketball sum(row) 5000  0.375 cereal 2000 1750 3750 1  0.667 not cereal 1000 250 1250 1250 sum(col.) 3000 2000 5000 1 5000  2.25 conviction(play basketball  eat cereal) = 0.375 1  0.333 conviction(play basketball  not eat cereal) = 2.25 40 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 41. From Association Mining to Correlation Analysis  Ex. Strong rules are not necessarily interesting Of 10000 transactions • 6000 customer transactions include computer games • 7500 customer transactions include videos • 4000 customer transactions include both computer game and video • Suppose that data mining program for videos games discovering association rules is run on the data, using min_sup of 30% and min_conf. of 60% • The following association rule is discovered: 4,000 buys(X, “computer games”)  buys(X, “videos”) [s=40%, c=66%] 41 =4000/10000 =4000/6000
  • 42. A misleading “strong” association rule buys(X, “computer games”)  buys(X, “videos”) [support=40%, confidence=66%]  This rule is misleading because the probability of purchasing video is 75% (>66%)  In fact, computer games and videos are negatively associated because the purchase of one of these items actually decreases the likelihood of purchasing the other. Therefore, we could easily make unwise business decisions based on this rule 42 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 43. From Association Analysis to Correlation Analysis  To help filter out misleading “strong” association  Correlation rules  A  B [support, confidence, correlation]  Lift is a simple correlation measure that is given as follows  The occurrence of itemset A is independent of the occurrence of itemset B if P(AB) = P(A)P(B);  Otherwise, itemset A and B are dependent and correlated  lift(A,B) = P(AB) / P(A)P(B) = P(B|A) / P(B) = conf(AB) / sup(B)  If lift(A,B) < 1, then the occurrence of A is negatively correlated with the occurrence of B  If lift(A,B) > 1, then A and B is positively correlated, meaning that the occurrence of one implies the occurrence of the other. 43 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 44. From Association Analysis to Correlation Analysis (Cont.)  Ex. Correlation analysis using lift buys(X, “computer games”)  buys(X, “videos”) [support=40%, confidence=66%]  The lift of this rule is P{game,video} / (P{game} × P{video}) = 0.40/(0.6 ×0.75) = 0.89  There is a negative correlation between the occurrence of {game} and {video}  Ex. Is this following rule misleading?  Buy walnuts  Buy milk [1%, 80%]”  if 85% of customers buy milk 44 Data Warehousing and Data Mining by Kritsada Sriphaew
  • 45. Homework  ให้ transactional database ซึ่งเป็น LOG ไฟล์บันทึกการเข้าเยี่ยมชมเว็บเพจของปู้ใช้แต่ละคน ในช่วงระยะเวลาหนึ่ง จงหากฎสัมพันธ์ที่น่าเชื่อถือ โดยสมมติว่าเราเป็นปู้วเิ คราะห์ข้อมูล มี สิทธิตั้ง minimum support และ minimum confidence ด้วยตัวเอง พร้อมอธิบายเหตุปล ประกอบว่าทาไมถึงตั้งค่านั้น และตรวจสอบด้วยว่ากฎเหล่านั้นเป็น misleading หรือไม่ ถ้ามีให้ แก้ไขอย่างไร TID List of items T001 P1, P2, P3, P4 T002 P3, P6 T003 P2, P5, P1 T004 P5, P4, P3,P6 T005 P1, P3, P4, P2 P1 P2 P4 P6 P3 P5 45
  • 46. Feb 26, 2011 (14:00)  Quiz I  Star-net Query (Multidimensional Table)  Data Cube Computation (Memory Calculation)  Data Preprocessing (Normalization, Smoothing by binning)  Association Rule Mining 46 Data Warehousing and Data Mining by Kritsada Sriphaew