Mining Frequent Closed Graphs on Evolving Data Streams

Mining Frequent Closed Graphs on Evolving Data Streams

`
A. Bifet, G. Holmes, B. Pfahringer and R. Gavalda
University of Waikato
Hamilton, New Zealand

Laboratory for Relational Algorithmics, Complexity and Learning LARCA
UPC-Barcelona Tech, Catalonia

San Diego, 24 August 2011
17th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining 2011

Mining Evolving Graph Data Streams
Problem
Given a data stream D of graphs, ﬁnd frequent closed graphs.

Transaction Id Graph We provide three algorithms,
O of increasing power
C C S N Incremental
1 O Sliding Window
O Adaptive
C C S N
2 C
N
3 C C S N

2 / 48

Non-incremental
Frequent Closed Graph Mining

CloseGraph: Xifeng Yan, Jiawei Han
right-most extension based on depth-ﬁrst search
based on gSpan ICDM’02

MoSS: Christian Borgelt, Michael R. Berthold
breadth-ﬁrst search
based on MoFa ICDM’02

3 / 48

Outline

1 Streaming Data
2 Frequent Pattern Mining
Mining Evolving Graph Streams
A DAG RAPH M INER
3 Experimental Evaluation
4 Summary and Future Work

4 / 48

Outline

1 Streaming Data
A DAG RAPH M INER

5 / 48

Mining Massive Data

Source: IDC’s Digital Universe Study (EMC), June 2011

6 / 48

Data Streams

Data Streams
Sequence is potentially inﬁnite
High amount of data
High speed of arrival
Once an element from a data stream has been processed
it is discarded or archived

Tools:
approximation
randomization, sampling
sketching

7 / 48

Data Streams Approximation Algorithms

1011000111 1010101

Sliding Window
We can maintain simple statistics over sliding windows, using
O ( 1 log2 N ) space, where
ε
N is the length of the sliding window
ε is the accuracy parameter

M. Datar, A. Gionis, P. Indyk, and R. Motwani.
Maintaining stream statistics over sliding windows. 2002

8 / 48


10110001111 0101011

Sliding Window
ε


8 / 48


101100011110 1010111

Sliding Window
ε


8 / 48


1011000111101 0101110

Sliding Window
ε


8 / 48


10110001111010 1011101

Sliding Window
ε


8 / 48


101100011110101 0111010

Sliding Window
ε


8 / 48

Outline

1 Streaming Data
A DAG RAPH M INER

9 / 48

Pattern Mining

Dataset Example
Document Patterns
d1 abce
d2 cde
d3 abce
d4 acde
d5 abcde
d6 bcd

10 / 48

Itemset Mining

d1 abce Support Frequent
d2 cde d1,d2,d3,d4,d5,d6 c
d3 abce d1,d2,d3,d4,d5 e,ce
d4 acde d1,d3,d4,d5 a,ac,ae,ace
d5 abcde d1,d3,d5,d6 b,bc
d6 bcd d2,d4,d5 d,cd
d1,d3,d5 ab,abc,abe
be,bce,abce
d2,d4,d5 de,cde

11 / 48

Itemset Mining

d1 abce Support Frequent
d2 cde 6 c
d3 abce 5 e,ce
d4 acde 4 a,ac,ae,ace
d5 abcde 4 b,bc
d6 bcd 4 d,cd
3 ab,abc,abe
be,bce,abce
3 de,cde

12 / 48

Itemset Mining

d1 abce Support Frequent Gen Closed
d2 cde 6 c c c
d3 abce 5 e,ce e ce
d4 acde 4 a,ac,ae,ace a ace
d5 abcde 4 b,bc b bc
d6 bcd 4 d,cd d cd
3 ab,abc,abe ab
be,bce,abce be abce
3 de,cde de cde

12 / 48

Itemset Mining

d1 abce Support Frequent Gen Closed Max
d2 cde 6 c c c
d3 abce 5 e,ce e ce
d4 acde 4 a,ac,ae,ace a ace
d5 abcde 4 b,bc b bc
d6 bcd 4 d,cd d cd
3 ab,abc,abe ab
be,bce,abce be abce abce
3 de,cde de cde cde

12 / 48

Outline

1 Streaming Data
A DAG RAPH M INER

13 / 48

Graph Dataset
Transaction Id Graph Weight
O
C C S N
1 O 1
O
C C S N
2 C 1
O
C S N
3 C 1
N
4 C C S N 1

14 / 48

Graph Coresets

Coreset of a set P with respect to some problem
Small subset that approximates the original set P.
Solving the problem for the coreset provides an
approximate solution for the problem on P.

δ -tolerance Closed Graph
A graph g is δ -tolerance closed if none of its proper frequent
supergraphs has a weighted support ≥ (1 − δ ) · support (g ).
Maximal graph: 1-tolerance closed graph
Closed graph: 0-tolerance closed graph.

15 / 48

Graph Coresets

Relative support of a closed graph
Support of a graph minus the relative support of its closed
supergraphs.
The sum of the closed supergraphs’ relative supports of a
graph and its relative support is equal to its own support.

(s, δ )-coreset for the problem of computing closed
graphs
Weighted multiset of frequent δ -tolerance closed graphs with
minimum support s using their relative support as a weight.

16 / 48

Graph Dataset
Transaction Id Graph Weight
O
C C S N
1 O 1
O
C C S N
2 C 1
O
C S N
3 C 1
N
4 C C S N 1

17 / 48

Graph Coresets

Graph Relative Support Support
C C S N 3 3
O
C S N 3 3
N
C S 3 3
Table: Example of a coreset with minimum support 50% and δ = 1

18 / 48

Graph Coresets

Figure: Number of graphs in a (40%, δ )-coreset for NCI.

19 / 48

Outline

1 Streaming Data
A DAG RAPH M INER

20 / 48

I NC G RAPH M INER

I NC G RAPH M INER (D, min sup )
Input: A graph dataset D, and min sup.
Output: The frequent graph set G.

1 G←0 /
2 for every batch bt of graphs in D
3 do C ← C ORESET (bt , min sup )
4 G ← C ORESET (G ∪ C, min sup )
5 return G

21 / 48

W IN G RAPH M INER
W IN G RAPH M INER (D, W , min sup )
Input: A graph dataset D, a size window W and min sup.
Output: The frequent graph set G.

1 G←0 /
4 Store C in sliding window
5 if sliding window is full
6 then R ← Oldest C stored in sliding window,
negate all support values
7 else R ← 0 /
8 G ← C ORESET (G ∪ C ∪ R, min sup )
9 return G

22 / 48

Algorithm ADaptive Sliding WINdow
Example
W = 101010110111111
W0 = 1

ADWIN: A DAPTIVE W INDOWING A LGORITHM
1 Initialize Window W
2 for each t > 0
3 do W ← W ∪ {xt } (i.e., add xt to the head of W )
4 repeat Drop elements from the tail of W
5 until |µW0 − µW1 | ≥ εc holds
ˆ ˆ
6 for every split of W into W = W0 · W1
7 Output µWˆ

23 / 48

Example
W = 101010110111111
W0 = 1 W1 = 01010110111111

2 for each t > 0
ˆ ˆ
7 Output µWˆ

23 / 48

Example
W = 101010110111111
W0 = 10 W1 = 1010110111111

2 for each t > 0
ˆ ˆ
7 Output µWˆ

23 / 48

Example
W = 101010110111111
W0 = 101 W1 = 010110111111

2 for each t > 0
ˆ ˆ
7 Output µWˆ

23 / 48

Example
W = 101010110111111
W0 = 1010 W1 = 10110111111

2 for each t > 0
ˆ ˆ
7 Output µWˆ

23 / 48

Example
W = 101010110111111
W0 = 10101 W1 = 0110111111

2 for each t > 0
ˆ ˆ
7 Output µWˆ

23 / 48

Example
W = 101010110111111
W0 = 101010 W1 = 110111111

2 for each t > 0
ˆ ˆ
7 Output µWˆ

23 / 48

Example
W = 101010110111111
W0 = 1010101 W1 = 10111111

2 for each t > 0
ˆ ˆ
7 Output µWˆ

23 / 48

Example
W = 101010110111111
W0 = 10101011 W1 = 0111111

2 for each t > 0
ˆ ˆ
7 Output µWˆ

23 / 48

Example
W = 101010110111111 |µW0 − µW1 | ≥ εc : CHANGE DET.!
ˆ ˆ
W0 = 101010110 W1 = 111111

2 for each t > 0
ˆ ˆ
7 Output µWˆ

23 / 48

Example
W = 101010110111111 Drop elements from the tail of W
W0 = 101010110 W1 = 111111

2 for each t > 0
ˆ ˆ
7 Output µWˆ

23 / 48

Example
W = 01010110111111 Drop elements from the tail of W
W0 = 101010110 W1 = 111111

2 for each t > 0
ˆ ˆ
7 Output µWˆ

23 / 48

Theorem
At every time step we have:
1 (False positive rate bound). If µt remains constant within
W , the probability that ADWIN shrinks the window at this
step is at most δ .
2 (False negative rate bound). Suppose that for some
partition of W in two parts W0 W1 (where W1 contains the
most recent items) we have |µW0 − µW1 | > 2εc . Then with
probability 1 − δ ADWIN shrinks W to W1 , or shorter.

ADWIN tunes itself to the data stream at hand, with no need for
the user to hardwire or precompute parameters.

24 / 48

ADWIN using a Data Stream Sliding Window Model,
can provide the exact counts of 1’s in O (1) time per point.
tries O (log W ) cutpoints
uses O ( 1 log W ) memory words
ε
the processing time per example is O (log W ) (amortized
and worst-case).

Sliding Window Model

1010101 101 11 1 1
Content: 4 2 2 1 1
Capacity: 7 3 2 1 1

25 / 48

A DAG RAPH M INER
A DAG RAPH M INER (D, Mode, min sup )
1 G←0 /
2 Init ADWIN
5 R←0 /
6 if Mode is Sliding Window
7 then Store C in sliding window
8 if ADWIN detected change
9 then R ← Batches to remove
in sliding window
with negative support
12 then Insert # closed graphs into ADWIN
13 else for every g in G update g’s ADWIN
14 return G

26 / 48

A DAG RAPH M INER
1 G←0 /
2 Init ADWIN
5 R←0 /
6
7
8
9

11
12
13 for every g in G update g’s ADWIN
14 return G

26 / 48

A DAG RAPH M INER
1 G←0 /
2 Init ADWIN
5 R←0 /
7 then Store C in sliding window
8 if ADWIN detected change
9 then R ← Batches to remove
in sliding window
with negative support
12 then Insert # closed graphs into ADWIN
13
14 return G

26 / 48

Outline

1 Streaming Data
A DAG RAPH M INER

27 / 48

Experimental Evaluation

ChemDB dataset
Public dataset
4 million molecules
Institute for Genomics and Bioinformatics at the University
of California, Irvine

Open NCI Database
Public domain
250,000 structures
National Cancer Institute

28 / 48

What is MOA?
{M}assive {O}nline {A}nalysis is a framework for online
learning from data streams.

It is closely related to WEKA
It includes a collection of ofﬂine and online as well as tools
for evaluation:
classiﬁcation
clustering
Easy to extend
Easy to design and run experiments

29 / 48

WEKA: the bird

30 / 48

MOA: the bird
The Moa (another native NZ bird) is not only ﬂightless, like the
Weka, but also extinct.

31 / 48

Classiﬁcation Experimental Setting

32 / 48

Classiﬁcation Experimental Setting
Classiﬁers
Naive Bayes
Decision stumps
Hoeffding Tree
Hoeffding Option Tree
Bagging and Boosting
ADWIN Bagging and
Leveraging Bagging

Prediction strategies
Majority class
Naive Bayes Leaves
Adaptive Hybrid
33 / 48

Clustering Experimental Setting

34 / 48


Clusterers
StreamKM++
CluStream
ClusTree
Den-Stream
CobWeb

35 / 48

Internal measures External measures
Gamma Rand statistic
C Index Jaccard coefficient
Point-Biserial Folkes and Mallow Index
Log Likelihood Hubert Γ statistics
Dunn’s Index Minkowski score
Tau Purity
Tau A van Dongen criterion
Tau C V-measure
Somer’s Gamma Completeness
Ratio of Repetition Homogeneity
Modified Ratio of Repetition Variation of information
Adjusted Ratio of Clustering Mutual information
Fagan’s Index Class-based entropy
Deviation Index Cluster-based entropy
Z-Score Index Precision
D Index Recall
Silhouette coefficient F-measure
Table: Internal and external clustering evaluation measures.
36 / 48

Cluster Mapping Measure

Hardy Kremer, Philipp Kranen, Timm Jansen, Thomas
Seidl, Albert Bifet, Geoff Holmes and Bernhard Pfahringer.
An Effective Evaluation Measure for Clustering on Evolving
Data Stream
KDD’11

CMM: Cluster Mapping Measure
A novel evaluation measure for stream clustering on evolving
streams

∑o∈F w (o ) · pen(o, C )
CMM (C , C L ) = 1 −
∑o∈F w (o ) · con(o, Cl (o ))

37 / 48

Extensions of MOA

Multi-label Classiﬁcation
Active Learning
Regression
Closed Frequent Graph Mining
Twitter Sentiment Analysis

Challenges for bigger data streams
Sampling and distributed systems (Map-Reduce, Hadoop, S4)

38 / 48

Open NCI dataset

Time NCI Dataset

120
100
80
Seconds

60
40
20
0
00

00

00

00

00

0

0

0

0

0

0

0

0
00

00

00

00

00

00

00

00
.0

.0

.0

.0

.0

0.

0.

0.

0.

0.

0.

0.

0.
10

30

50

70

90

11

13

15

17

19

21

23

25
Instances
IncGraphMiner IncGraphMiner-C MoSS closeGraph

39 / 48

Open NCI dataset

Memory NCI Dataset

9000
8000
7000
Megabytes

6000
5000
4000
3000
2000
1000
0
00

00

00

00

00

0

0

0

0

0

0

0

0
00

00

00

00

00

00

00

00
.0

.0

.0

.0

.0

0.

0.

0.

0.

0.

0.

0.

0.
10

30

50

70

90

11

13

15

17

19

21

23

25
Instances
IncGraphMiner IncGraphMiner-C MoSS closeGraph

40 / 48

Megabytes
10
.

0
5000
10000
15000
20000
25000
30000
35000
40000
45000
24 000 50000
0.
47 000
0.
70 000
0.
93 000

IncGraphMiner
1. 0. 0
16 00
1. 0.0
39 00
1. 0.0
62 00
1. 0.0
85 00
2. 0.0
08 00
2. 0.0
31 00

IncGraphMiner-C
2. 0.0
54 00
Instances

2. 0.0
77 00
Memory ChemDB Dataset

3. 0.0
MoSS

00 00
3. 0.0
23 00
ChemDB dataset

3. 0.0
46 00
3. 0.0
69 00
3. 0.0
92 00
closeGraph

0.
00
0
41 / 48

Seconds
10
.

0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
24 000
0.
47 000
0.
70 000
0.
93 000
1. 0. 0
16 00
1. 0.0
39 00

IncGraphMiner
1. 0.0
62 00
1. 0.0
85 00
2. 0.0
08 00
2. 0.0
31 00
2. 0.0
54 00
Instances
IncGraphMiner-C
2. 0.0
77 00
Time ChemDB Dataset

3. 0.0
00 00
3. 0.0
23 00
MoSS
ChemDB dataset

3. 0.0
46 00
3. 0.0
69 00
3. 0.0
92 00
0.
00
0
closeGraph

42 / 48

Number of Closed Graphs
10
.0

0
10
20
30
40
50
60

0
60 0
.0
11 00
0.
0
16 00
0.
00
21 0
0.
0
26 00
0.
0
31 00
0.
0
36 00
0.
0

ADAGRAPHMINER
41 00
0.
0
46 00
0.
0
51 00
0.
0
56 00
0.
00
Instances 61 0
0.
0
66 00
0.
0
71 00
0.
0
76 00
ADAGRAPHMINER-Window

0.
0
81 00
0.
A DAG RAPH M INER

0
86 00
0.
0
91 00
0.
0
96 00
0.
00
IncGraphMiner

0
43 / 48

Outline

1 Streaming Data
A DAG RAPH M INER

44 / 48

Summary
Mining Evolving Graph Data Streams
Given a data stream D of graphs, ﬁnd frequent closed graphs.

Transaction Id Graph We provide three algorithms,
O of increasing power
C C S N Incremental
1 O Sliding Window
O Adaptive
C C S N
2 C
N
3 C C S N

45 / 48

Summary
{M}assive {O}nline {A}nalysis is a framework for online
learning from data streams.

http://moa.cs.waikato.ac.nz

It is closely related to WEKA
It includes a collection of ofﬂine and online as well as tools
for evaluation:
classiﬁcation
clustering
frequent pattern mining
MOA deals with evolving data streams
MOA is easy to use and extend

46 / 48

Future work

Structured massive data mining
sampling, sketching
parallel techniques

47 / 48

Mining Frequent Closed Graphs on Evolving Data Streams

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (10)

Similar to Mining Frequent Closed Graphs on Evolving Data Streams

Similar to Mining Frequent Closed Graphs on Evolving Data Streams (20)

More from Albert Bifet

More from Albert Bifet (20)

Recently uploaded

Recently uploaded (20)

Mining Frequent Closed Graphs on Evolving Data Streams