2. Data Mining:
Data mining is the efficient discovery of
valuable, non obvious information from a
large collection of data.
Prepared By- Mr.Nilesh Magar
3. Most important concepts in Data-mining
Item-set & frequent item-set:
Market Basket model
Frequent Item-set:
Prepared By- Mr.Nilesh Magar
5. Association Rule:
Medical diagnosis dataset-symptoms and illness.
A rule is define as an implication of the form X Y where X,Y
I (Items). Or in other words: if { i1, i2,…, ik} j, means: if a
basket contains all of i1,…, ik then it is likely to Contain j.
The probability of finding Y for us to accept this rule is called
the confidence of the rule.
Conf(X Y)=SUPP(X U Y)/SUPP(X)
{m,b}c ::: Confidence = 2/4 = 50%
Thus Association mining is 2 step Process:
Find all frequent item-sets:
Generate Strong association rules from frequent item-setPrepared By- Mr.Nilesh Magar
6. The Apriori algorithm
Mining frequent item-set for Boolean association rule
Prior knowledge
Iterative approach known as level-wise search k-item-
sets are used to explore (k+1)-item-sets
One full scan of the database required to find lK , L1->
Items with Min Support. L2-> generating 2-item-set etc.
Prepared By- Mr.Nilesh Magar
7. Two steps:
Join
finding Lk, a set of candidate k-itemsets is generated
by joining Lk-1 with itself
Prune
To reduce the size of Ck the Apriori property is used:
if any (k-1) subset of a candidate k-itemset is not in
Lk-1, then the candidate cannot be frequent either,so
it can be removed from Ck. – subset testing.
Prepared By- Mr.Nilesh Magar
11. Scan D for count of each candidate
C1: I1 – 6, I2 – 7, I3 -6, I4 – 2, I5 - 2
Compare candidate support count with minimum support count (min_sup=2)
L1: I1 – 6, I2 – 7, I3 -6, I4 – 2, I5 - 2
Generate C2 candidates from L1 and scan D for count of each candidate
C2: {I1,I2} – 4, {I1, I3} – 4, {I1, I4} – 1, …
Compare candidate support count with minimum support count
L2: {I1,I2} – 4, {I1, I3} – 4, {I1, I5} – 2, {I2, I3} – 4, {I2, I4} - 2, {I2, I5} – 2
Generate C3 candidates from L2 using the join and prune steps:
Join: C3=L2xL2={{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4,
I5}}
Prune: C3: {I1, I2, I3}, {I1, I2, I5}
Scan D for count of each candidate
C3: {I1, I2, I3} - 2, {I1, I2, I5} – 2
Compare candidate support count with minimum support count
L3: {I1, I2, I3} – 2, {I1, I2, I5} – 2
Generate C4 candidates from L3
C4=L3xL3={I1, I2, I3, I5}
This itemset is pruned, because its subset {{I2, I3, I5}} is not frequent => C4=null
Prepared By- Mr.Nilesh Magar
12. Generating association rules from
frequent item-sets: from Slide 5
Finding the frequent item-sets from
transactions in a database D
Generating strong association rules:
Confidence(A=>B)=P(B|A)=
support_count(AUB)/support_count(A)
support_count(AUB) – number of transactions
containing the itemsets AUB
support_count(A) - number of transactions containing
the itemsets A
Prepared By- Mr.Nilesh Magar
13. Example:
lets have l={I1, I2, I5}
The nonempty subsets are {I1, I2}, {I1, I5}, {I2, I5}, {I1}, {I2}, {I5}.
Generating association rules:
I1 and I2=>I5 conf=2/4=50%
I1 and I5=>I2 conf=2/2=100%
I2 and I5=> I1 conf=2/2=100%
I1=>I2 and I5 conf=2/6=33%
I2=>I1 and I5 conf=2/7=29%
I5=>I1 and I2 conf=2/2=100%
If min_conf is 70%, then only the second, third and last rules above
are output.
Prepared By- Mr.Nilesh Magar
14. Advantages & Disadvantages:
Adv:
1) Uses Large item-set
Property
2) Easily parallelized
3) Easy to implement
Dis-Adv:
1) Assumes
transaction
database is
memory resident
Requires up to ‘m’
database scan.
Prepared By- Mr.Nilesh Magar
15. Mining Frequent Itemsets without
candidate generation
The candidate generate and test method
It may need to generate a huge number of
candidate sets
It may need to repeatedly scan the database
and check a large set of candidates by pattern
matching
Frequent-pattern growth method(FP-
growth) – frequent pattern tree(FP-tree)
Prepared By- Mr.Nilesh Magar
20. Mining frequent itemsets using vertical
data format
Transforming the horizontal data format of the
transaction database D into a vertical data
format:
Itemset TID_set
I1 {T100, T400, T500, T700, T800, T900}
I2 {T100, T200, T300, T400, T600, T800, T900}
I3 {T300, T500, T600, T700, T800, T900}
I4 {T200, T400}
I5 {T100, T800}
Prepared By- Mr.Nilesh Magar
27. FP-Tree Algorithm:
Input: DB, min_support
Output: FP-Tree
1. Scan DB & count all frequent items.
2. Create null root & set as current node.
3. For each Transaction T
Sort T’s items.
For each sorted Item I
Insert I into tree as a child of current node.
Connect new tree node to header list.
Prepared By- Mr.Nilesh Magar
29. Adv. & disAdv. Of FP- Growth:
Adv:
1) Only 2 Passes Over Data-set
2) No Candidate Generation
3) Much Faster Than Apriori
DisAdv:
• FP-Tree may not fit in memory.
• FP-Tree is expensive to build
Prepared By- Mr.Nilesh Magar
30. Subjects
1) U.M.L.
2) P.P.L.
3) D.M.D.W.
4) O.S.
5) Programming Languages
6) RDBMS
Mr. Nilesh Magar
Lecturer at MIT, Kothrud, Pune.
9975155310.
Prepared By - Mr. Nilesh Magar