Mining Uncertain Data (Sebastiaan van Schaaik)

Seminar web data extraction 1 / 26

Seminar web data extraction: Mining uncertain data

Sebastiaan van Schaik
Sebastiaan.van.Schaik@comlab.ox.ac.uk

20 January 2011


Seminar web data extraction > Frequent patterns & association rules 2 / 26

Introduction

Focus of this presentation: mining of frequent patterns and
association rules from (uncertain) data.

Example applications:
discover regularities in customer transactions;
analysing log ﬁles: determine how visitors use a website;

Based on:
Mining Uncertain Data with Probabilistic Guarantees[9] (KDD 2010);
Frequent Pattern Mining with Uncertain Data[1] (KDD 2009);
A Tree-Based Approach for Frequent Pattern Mining from Uncertain
Data[6] (PAKDD 2008).



Introduction & running example
Frequent pattern (itemset): items that occurs suﬃciently often.
Example: {fever, headache}

Association rule: a set of items values implying another set of items.
Example: {fever, headache} ⇒ {nausea}

Patient Diagnosis
t1 Cheng {severe cold}
t2 Andrey {yellow fever, haemochromatosis}
t3 Omer {schistosomiasis, syringomyelia}
t4 Tim {Wilson’s disease}
t5 Dan {Hughes-Stovin syndrome} Yellow fever?
t6 Bas {Henoch-Schnlein purpura}
Running example: patient diagnosis database


Measuring ‘interestingness’: support & conﬁdence
Support of an itemset X :
sup(X ): number of entries (rows, transactions) that contain X

Conﬁdence of an association rule X ⇒ Y :
sup(X ∪ Y )
conf(X ⇒ Y ) =
sup(X )



Finding association rules: Apriori (1)

Agrawal et al. introduced Apriori in 1994[2] to mine association rules:
1 Find all frequent itemsets X in database D (X is frequent iﬀ
i
sup(Xi ) > minsup):
1 Candidate generation: generate all possible itemsets of length k
(starting k = 1) based on frequent itemsets of length k − 1;
2 Test candidates, discard infrequent itemsets;
3 Repeat with k = k + 1.




Agrawal et al. introduced Apriori in 1994[2] to mine association rules:
1 Find all frequent itemsets X in database D (X is frequent iﬀ
i
sup(Xi ) > minsup):
1 Candidate generation: generate all possible itemsets of length k
(starting k = 1) based on frequent itemsets of length k − 1;
2 Test candidates, discard infrequent itemsets;
3 Repeat with k = k + 1.

Important observation: all subsets X of a frequent itemset X are
frequent (Apriori property). Used to purge before step (2).

Example: if X = {fever} is not frequent in database D, then
X = {fever, headache} can not be frequent.




Apriori continued:
2 Extract association rules from frequent itemsets X . For each

Xi ∈ X :
1 Generate all non-empty subsets S of Xi . For each S:
2 Test conﬁdence of rule S ⇒ (Xi − S)




Apriori continued:
2 Extract association rules from frequent itemsets X . For each

Xi ∈ X :
1 Generate all non-empty subsets S of Xi . For each S:
2 Test conﬁdence of rule S ⇒ (Xi − S)

Example: itemset X = {fever, headache, nausea} is frequent, test:
{fever, headache} ⇒ {nausea}
{fever, nausea} ⇒ {headache}
{nausea, headache} ⇒ {fever}
{fever} ⇒ {headache, nausea}
(. . . )


Seminar web data extraction > Introduction to uncertain data 7 / 26

Introduction to uncertain data

Data might be uncertain, for example:
Location detection using multiple RFID sensors (triangulation);
Sensor readings (temperature, humidity) are noisy;
Face recognition;
Patient diagnosis.

Challenge: how do we model uncertainty and
take it into account when mining frequent
itemsets and association rules?



Existential probabilities

Existential probability: a probability is associated with each item in a
tuple, expressing the odds that the item belongs to that tuple.

Important assumption: tuple and item independence!



Existential probabilities

Existential probability: a probability is associated with each item in a
tuple, expressing the odds that the item belongs to that tuple.

Important assumption: tuple and item independence!

Patient Diagnosis (including existential probabilities)
t1 Cheng { 0.9 : a 0.72 : d 0.718 : e 0.8 : f }
t2 Andrey { 0.9 : a 0.81 : c 0.718 : d 0.72 : e }
t3 Omer { 0.875 : b 0.857 : c }
t4 Tim { 0.9 : a 0.72 : d 0.718 : e }
t5 Dan { 0.875 : b 0.857 : c 0.05 : d }
t6 Bas { 0.875 : b 0.1 : f }
Simpliﬁed probabilistic diagnosis database (adapted from [6])



Possible worlds

D = {t1 , t2 , . . . , tn } (n transactions)
tj = (p(j,1) , i1 ), . . . , (p(j,m) , im ) (m items in each transaction)
D can be expanded to possible worlds: W = {W1 , . . . , W2nm }.



Possible worlds

D = {t1 , t2 , . . . , tn } (n transactions)
tj = (p(j,1) , i1 ), . . . , (p(j,m) , im ) (m items in each transaction)
D can be expanded to possible worlds: W = {W1 , . . . , W2nm }.

Patient Diagnosis (including prob.)
t1 Cheng { 0.9 : a 0.72 : d 0.718 : e 0.8 : f }
t2 Andrey { 0.9 : a 0.81 : c 0.718 : d 0.72 : e }
t3 Omer { 0.875 : b 0.857 : c }
t4 Tim { 0.9 : a 0.72 : d 0.718 : e }
t5 Dan { 0.875 : b 0.857 : c 0.05 : d }
t6 Bas { 0.875 : b 0.1 : f }

Pr[Wx ] = (1 − p(1,a) ) · p(1,d) · (1 − p(1,e) ) · p(1,f ) · p(2,a) · . . . · p(6,f )
= 0.1 · 0.72 · 0.29 · 0.2 · 0.9 · . . . · 0.1
≈ 0.00000021 (one of the 218 possible worlds)

Seminar web data extraction > Mining uncertain data > Introduction 10 / 26

Introduction

Approaches to mining frequent itemsets from uncertain data:
U-Apriori[4] and p-Apriori[9]
UF-growth[6]
UFP-tree[1]
...

Further focus:
UF-growth: mining without candidate generation;
p-Apriori: pruning using Chernoﬀ bounds



Expected support

Support of an itemset X turns into a random variable:

E [sup(X )] = Pr[Wi ] · supWi (X )
Wi ∈W



Expected support

Support of an itemset X turns into a random variable:

E [sup(X )] = Pr[Wi ] · supWi (X )
Wi ∈W

Enumerating all possible worlds is infeasible, however (because of
independency assumptions):

E [sup(X )] = Pr[x, tj ]
tj ∈D x∈X

(see [7, 6])



Expected support (2)
t1 Cheng { 0.9 : a 0.72 : d 0.718 : e 0.8 : f }
t2 Andrey { 0.9 : a 0.81 : c 0.718 : d 0.72 : e }
t3 Omer { 0.875 : b 0.857 : c }
t4 Tim { 0.9 : a 0.72 : d 0.718 : e }
t5 Dan { 0.875 : b 0.857 : c 0.05 : d }
t6 Bas { 0.875 : b 0.1 : f }

Expected support of itemset X = {a, d} in patient diagnosis database:
supWx (X ) = 2
E[sup(X )] = Pr[Wi ] · supWi (X )
Wi ∈W



Expected support (2)
t1 Cheng { 0.9 : a 0.72 : d 0.718 : e 0.8 : f }
t2 Andrey { 0.9 : a 0.81 : c 0.718 : d 0.72 : e }
t3 Omer { 0.875 : b 0.857 : c }
t4 Tim { 0.9 : a 0.72 : d 0.718 : e }
t5 Dan { 0.875 : b 0.857 : c 0.05 : d }
t6 Bas { 0.875 : b 0.1 : f }

Expected support of itemset X = {a, d} in patient diagnosis database:
supWx (X ) = 2
E[sup(X )] = Pr[Wi ] · supWi (X )
Wi ∈W

= Pr[x, tj ]
tj ∈D x∈X

= 0.9 · 0.72 + 0.9 · 0.71 + 0 · 0 + 0.9 · 0.72 + 0 · 0.05 + 0 · 0
= 1.935


Frequent itemsets in probabilistic databases

An itemset X is frequent iﬀ:

UF-growth: E[sup(X )] > minsup (also used in [4, 1] and many others)
p-Apriori: Pr[sup(X ) > minsup] ≥ minprob


Seminar web data extraction > Mining uncertain data > UF-growth 14 / 26

Introduction to UF-growth
Apriori versus UF-growth:
Apriori-like algorithms generate and test candidate itemsets;
UF-growth[6] (based on FP-growth[5]) grows a tree based on a
probabilistic database.




Outline of procedure (example follows):
1 First scan: determine expected support of all items;
2 Second scan: create branch for each transaction (merging
identical nodes when possible). Each node contains:
An item;
Its probability;
Its occurrence count.
Example: (a, 0.9, 2)




Outline of procedure (example follows):
1 First scan: determine expected support of all items;
2 Second scan: create branch for each transaction (merging
identical nodes when possible). Each node contains:
An item;
Its probability;
Its occurrence count.
Example: (a, 0.9, 2)
An itemset X is frequent iﬀ: E[sup(X )] > minsup


UF-tree example (1)
t1 Cheng { 0.9 : a 0.72 : d 0.718 : e 0.8 : f }
t2 Andrey { 0.9 : a 0.81 : c 0.718 : d 0.72 : e }
t3 Omer { 0.875 : b 0.857 : c }
t4 Tim { 0.9 : a 0.72 : d 0.718 : e }
t5 Dan { 0.875 : b 0.857 : c 0.05 : d }
t6 Bas { 0.875 : b 0.1 : f }

1) determine exp. support 2) build tree

E [sup({a})] = 2.7
E [sup({b})] = 2.625
E [sup({c})] = 2.524
E [sup({d})] = 2.20875
E [sup({e})] = 2.1575
E [sup({f })] = 0.9 (from [6])


UF-tree example (2)

Extract frequent patterns from FP-tree:
E [sup({a, e})] = 1 · 0.72 · 0.9 + 2 · 0.71875 · 0.9 = 1.94175
E [sup({c, e})] = 1 · 0.72 · 0.81 = 0.5832
E [sup({d, e})] = 1 · 0.72 · 0.71875 + 2 · 0.71875 · 0.72 = 1.5525
E [sup({a, d, e})] = 1 · 0.9 · 0.71875 · 0.72 + 2 · 0.9 · 0.72 · 0.71875 = 1.39725


UF-growth continued

Mining larger itemsets can be done more eﬃciently using tree
projections.

Remarks:
Nodes can only be merged when items have identical probabilities
(otherwise, all occurence counts equal 1);
Suggested solution in [6]: rounding of probabilities;
Other solution (from [1]): store a carefully constructed summary
of probabilities in each node. Might yield overestimation of
expected support.


Seminar web data extraction > Mining uncertain data > p-Apriori 18 / 26

Introduction to p-Apriori

Apriori has been extended to support uncertainty;
New pruning techniques[9, 4, 3] improve eﬃciency;
Note: the apriori (“downwards closure”) property still holds in the
probabilistic case[1];
Goal: prune candidates, saving as much time as possible.



Introduction to p-Apriori

Apriori has been extended to support uncertainty;
New pruning techniques[9, 4, 3] improve eﬃciency;
Note: the apriori (“downwards closure”) property still holds in the
probabilistic case[1];
Goal: prune candidates, saving as much time as possible.

In p-Apriori, an itemset X is frequent iﬀ:

Pr[sup(X ) > minsup] ≥ minprob



p-Apriori: advanced frequent itemset mining

Sun et al. [9] use a simpliﬁed approach to modelling uncertainty: each
tuple ti is associated with an existential probability pi .

1
Interesting course: Probability & Computing by James Worrell


p-Apriori: advanced frequent itemset mining

Sun et al. [9] use a simpliﬁed approach to modelling uncertainty: each
tuple ti is associated with an existential probability pi .

In p-Apriori: itemset X is frequent if and only if:

Pr[sup(X ) > minsup] ≥ minprob

Let cnt(X ) denote the number of tuples containing X , then:

cnt(X ) < minsup ⇒ X can not be frequent

Chernoﬀ bounds1 provide a strict bound on the tail distributions of
sums of independent random variables.
1
Interesting course: Probability & Computing by James Worrell


p-Apriori: pruning using Chernoﬀ Bounds (1)
Each tuple ti is associated with an existential probability pi . Then:

1 with probability pi
Yi =
0 with probability 1 − pi

Y = Yi = sup(X )



Each tuple ti is associated with an existential probability pi . Then:

1 with probability pi
Yi =
0 with probability 1 − pi

Y = Yi = sup(X )

Furthermore:

µ = E[sup(X )]
minsup − µ − 1
δ =
µ
Pr[sup(X ) ≥ minsup] = Pr [sup(X ) > minsup − 1]
= Pr [sup(X ) > (1 + δ)µ]



Using a Chernoﬀ bound (see [8], theorem 4.3 and exercise 4.1):
2−δµ if δ ≥ 2e − 1
Pr[sup(X ) ≥ minsup] < −δ 2 µ
e 4 otherwise

Therefore: an itemset X can not be frequent if:
for δ ≥ 2e − 1 : 2−δµ < minprob
−δ 2 µ
for 0 < δ < 2e − 1 : e 4 < minprob

Example with minprob = 0.4, minsup = 9 and E [sup(X )] = 3:
2
−δ 2 µ − 9−3−1
( 3 )
·3
25
e 4 =e 4 = e − 12 ≈ 0.125 < minprob


p-Apriori: finding frequent patterns (DP)

The p-Apriori algorithm for finding frequent patterns resembles apriori:
1 Generate set of candidate k-itemsets Ck based on frequent
itemsets of length k − 1
2 For each itemset X ∈ Ck :
1 Try pruning by using apriori property
2 Compute cnt(X ), try pruning using Chernoff bound
3 For each itemset X ∈ Ck left: compute pmf in O(n2 ) time,
compare against minprob

(association rules can be mined using the frequent patterns)


Seminar web data extraction > Summary & conclusion 23 / 26

Summary & conclusion

Data mining of uncertain data is a new, fast moving field;
Data uncertainty introduces a significant complexity layer;
Different algorithms use different definitions and models;
Algorithm performance greatly depends on data.


Seminar web data extraction > References 24 / 26

References I

C. C. Aggarwal, Y. Li, and Jing Wang.
Frequent pattern mining with uncertain data.
discovery and data mining, pages 29–37, 2009.
Jorge B. Bocca, Matthias Jarke, and Carlo Zaniolo, editors.
Fast algorithms for mining association rules, volume 1215 of Proc
20th Int Conf Very Large Data Bases VLDB. Citeseer, 1994.
C. K. Chui and B. Kao.
decremental approach for mining frequent itemsets from uncertain
data.
Proceedings of the 12th Paciﬁc-Asia conference on Advances in
knowledge discovery and data mining, pages 64–75, 2008.



References II

C. K. Chui, Ben Kao, and Edward Hung.
Mining frequent itemsets from uncertain data.
Advances in Knowledge Discovery and Data Mining, pages 47–58,
2007.
J. Han, J. Pei, Y. Yin, and R. Mao.
Mining frequent patterns without candidate generation: A
frequent-pattern tree approach.
Data mining and knowledge discovery, 8(1):53–87, 2004.
C. Leung, M. Mateo, and D. Brajczuk.
tree-based approach for frequent pattern mining from uncertain
data.
Advances in Knowledge Discovery and Data Mining, pages
653–661, 2008.



References III

C. K. S. Leung, B. Hao, and F. Jiang.
Constrained frequent itemset mining from uncertain data streams.
pages 120–127, 2010.
R. Motwani and P. Raghavan.
Randomized Algorithms.
Cambridge University Press, 1995.
Liwen Sun, R. Cheng, and D. W. Cheung.
Mining uncertain data with probabilistic guarantees.
discovery and data mining, pages 273–282, 2010.
Recommended by Dan Olteanu, read by Nov 12 4pm.


Mining Uncertain Data (Sebastiaan van Schaaik)

Recommended

Recommended

More Related Content

Similar to Mining Uncertain Data (Sebastiaan van Schaaik)

Similar to Mining Uncertain Data (Sebastiaan van Schaaik) (20)

Mining Uncertain Data (Sebastiaan van Schaaik)