SlideShare una empresa de Scribd logo
1 de 30
Descargar para leer sin conexión
Mining Sequential
Patterns
1
Sequential Pattern Mining
 Sequence database – consists of sequences of ordered
elements or events (with or without time)
 Sequential Pattern Mining is the mining of frequently
occurring ordered events or subsequences as patterns
 Example:
 Customer shopping sequences:
 First buy computer, then CD-ROM, and then digital camera, within 3
months.
 Web access patterns, Weather prediction
 Usually categorical or symbolic data
 Numeric data analysis – Time Series Analysis
2
Sequential Pattern Mining
 I = {I1, I2, …Ip} – Set of items
 Sequence s = <e1 e2 e3… el>
 Ordered list of events
 Each event is an element of the sequence
 In Shopping databases, event – shopping trip in which customers bought items (x1 x2 …xq)
 All trips by customer form a sequence
 Item can occur at most once in an event, but several times in a sequence
 Sequence with length l : l-sequence
 A sequence α = <a1a2…an> is a subsequence of β = <b1b2..bm> denoted as α
⊆ β if there exists integers j1, j2, … jn between 1 and m such that a1 ⊆ bj1, a2 ⊆
bj2,… an ⊆ bjn
 α = <(ab),d> and β = <(abc),(de)> α is a sub-sequence of β
3
Sequential Pattern Mining
 A sequence database, S, is a set of tuples, <SID,s>, where SID is a
sequence ID and s is a sequence
 A tuple <SID, s> is said to contain a sequence a, if a is a
subsequence of s
 The support of a sequence α in a sequence database S is the
number of tuples in the database containing α
 Given the minimum support threshold, a sequence a is frequent in
sequence database S if support S(a) >= min sup
 A frequent sequence is called a sequential pattern
 A sequential pattern with length l is called an l-pattern
4
Sequential Patterns
 Length of Sequence 1 – 9
 It contains ‘a’ multiple times but sequence 1 will contribute only one
to the support of <a>
 Sequence <a(bc)df> - Subsequence of Sequence 1
 Support for <(ab)c> is 2 (Present in 1 and 3)
 Frequent as it satisfies minimum support of 2
5
SID Sequence
1 <a(abc)(ac)d(cf)>
2 <(ad)c(bc)(ae)>
3 <(ef)(ab)(df)cb>
4 <eg(af)cbc>
Scalable Methods for Mining
Sequential Patterns
 Full Set Vs Closed Set
 A sequential patterns s is closed if there exists no s’ where s’ is a proper
super-sequence of s and s’ has same support as s
 Subsequences of a frequent sequence are also ferquent
 Mining closed sequential patterns avoids generation of unnecessary sub-sequences
 GSP – Candidate generate and test approach on horizontal data
format
 SPADE – Candidate generate and test approach on vertical data
format
 PrefixScan – does not require candidate generation
 All approaches exploit Apriori property – every non empty
subsequence of a sequential pattern is a sequential pattern
6
GSP: Sequential Pattern Mining
 GSP (Generalized Sequential Patterns)
 Multi-pass, Candidate generate and test approach proposed by Agrawal
and Srikant
 Outline of the method
 Initially, every item in DB is a candidate of length-1
 for each level (i.e., sequences of length-k) do
 Scan database to collect support count for each candidate sequence
 Generate candidate length-(k+1) sequences from length-k frequent
sequences using Apriori
 A (k-1)-sequence w1 is merged with another (k-1)-sequence w2 to produce a
candidate k-sequence if the subsequence obtained by removing the first event in w1
is the same as the subsequence obtained by removing the last event in w2
 repeat until no frequent sequence or no candidate can be found
 Major strength: Candidate pruning by Apriori
 Weakness : Generates large number of candidates
7
Generalized Sequential Patterns
8
 Initial candidates: all singleton sequences
 <a>, <b>, <c>, <d>, <e>, <f>, <g>, <h>
 Scan database once, count support for candidates
Seq. ID Sequence
1 <(cd)(abc)(abf)(acdf)>
2 <(abf)(e)>
3 <(abf)>
4 <(dgh)(bf)(agh)>
Cand Sup
<a> 4
<b> 4
<c> 1
<d> 2
<e> 1
<f> 4
<g> 1
<h> 1
Generalized Sequential Patterns
9
Cand Sup
<a> 4
<b> 4
<d> 2
<f> 4
Length 2 Candidates generated by joinLength 2 Candidates generated by join
<aa> <ab> <ad> <af> <ba> <bb> <bd> <bf>
<da> <db><dd> <df> <fa> <fb> <fd> <ff>
<(ab)> <(ad)> <(af)> <(bd)> <(bf)> <(df)>
Seq. ID Sequence
1 <(cd)(abc)(abf)(acdf)>
2 <(abf)(e)>
3 <(abf)>
4 <(dgh)(bf)(agh)>
Length 2 Frequent SequencesLength 2 Frequent Sequences
<ba> <da> <db> <df> <fa>
<(ab)> <(af)> <(bf)>
Generalized Sequential Patterns
10
Length 2 Frequent SequencesLength 2 Frequent Sequences
<ba> <da> <db> <df> <fa>
<(ab)> <(af)> <(bf)>
Length 3 Candidates generated by joinLength 3 Candidates generated by join
<ba> and <(ab)> - <b(ab)> {1}
<ba> and <(af)> - <b(af)> {1}
<da> and <(ab)> - <d(ab)> {1}
<da> and <(af)> - <d(af)> {1}
<db> and <(bf)> - <d(bf)> {1, 4}
<db> and <ba> - <dba> {1, 4}
<df> and <fa> - <dfa> {1, 4}
<fa> and <(ab)> - <f(ab)> -
<fa> and <(af)> - <f(af)> {1}
<(ab)> and <(bf)> - <(abf)> {1,2,3}
<(ab)> and <ba> - <(ab)a> {1}
<(af)> and <fa> - <(af)a) {1}
<(bf)> and <fa> - <(bf)a> {1, 4}
Seq. ID Sequence
1 <(cd)(abc)(abf)(acdf)>
2 <(abf)(e)>
3 <(abf)>
4 <(dgh)(bf)(agh)>
Length 3 Frequent SequencesLength 3 Frequent Sequences
<dba> <dfa> <(abf)> <(bf)a> <d(bf)>
Length 4 Candidates generated by joinLength 4 Candidates generated by join
<d(bf)> and <(bf)a> - <d(bf)a> {1, 4}
<(abf)> and <(bf)a> - <(abf)a> {1}
SPADE: Sequential Pattern Mining on
Vertical data format
 SPADE – Sequential PAttern Discovery using Equivalent classes
 Vertical data format <itemset: (sequence_ID, event_ID)>
 ID_list : Set of (sequence_ID, event_ID) pairs for a given itemset
 Mapping from horizontal to vertical format requires one scan
 Support of k-sequences can be determined by joining the ID lists of
(k-1) sequences
 To find candidate 2-sequences, all single items are joined if they are
frequent, if they share the same sequence identifier and if their event
identifier follows a sequential ordering
 Patterns are grown similarly
11
Vertical Data Format
12
SPADE - Example
13
Seq. ID Sequence
1 <(cd)(abc)(abf)(acdf)>
2 <(abf)(e)>
3 <(abf)>
4 <(dgh)(bf)(agh)>
SID EID Itemset
1 1 cd
1 2 abc
1 3 abf
1 4 acdf
2 1 abf
2 2 e
3 1 abf
4 1 dgh
4 2 bf
4 3 agh
Vertical Data Format
SPADE - Example
14
SID EID Itemset
1 1 cd
1 2 abc
1 3 abf
1 4 acdf
2 1 abf
2 2 e
3 1 abf
4 1 dgh
4 2 bf
4 3 agh
Vertical Data Format
SID EID
1 2
1 3
1 4
2 1
3 1
4 3
a
SID EID
1 2
1 3
2 1
3 1
4 2
b
SID EID
1 1
1 2
1 4
c
SID EID
1 1
1 4
4 1
d
SID EID
2 2
e
SID EID
1 3
1 4
2 1
3 1
4 2
f
SID EID
4 1
4 3
g
SID EID
4 1
4 3
h (ab) – 1,2; 1,3; 2,1;
3,1
(ad) – 1,4
(af) – 1,4; 2,1;3,1
(bd) – NIL
(bf) – 1,3; 2,1; 3,1;4,2
(df) – 1,4
SPADE - Example
15
SID EID
1 2
1 3
1 4
2 1
3 1
4 3
a
SID EID
1 2
1 3
2 1
3 1
4 2
b
SID EID
1 1
1 4
4 1
d
SID EID
1 3
1 4
2 1
3 1
4 2
f
SID EID(a) EID(a)
1 2 3
1 3 4
1 2 4
aa
SID EID(a) EID(b)
1 2 3
ab
SID EID(a) EID(d)
1 2 4
1 3 4
ad
SID EID(a) EID(b)
1 2 3
1 3 4
af
SID EID(b) EID(a)
1 2 3
1 3 4
4 2 3
ba
SID EID(b) EID(b)
1 2 3
bb
SID EID(b) EID(d)
1 2 4
1 3 4
bd
SID EID(b) EID(f)
1 2 3
1 3 4
bf
SPADE - Example
16
SID EID
1 2
1 3
1 4
2 1
3 1
4 3
a
SID EID
1 2
1 3
2 1
3 1
4 2
b
SID EID
1 1
1 4
4 1
d
SID EID
1 3
1 4
2 1
3 1
4 2
f
SID EID(d) EID(a)
1 1 2
1 1 3
1 1 4
4 1 3
da
SID EID(d) EID(b)
1 1 2
1 1 3
4 1 2
db
SID EID(d) EID(d)
1 1 4
dd
SID EID(d) EID(f)
1 1 3
1 1 4
4 1 2
df
SID EID(f) EID(a)
1 3 4
4 2 3
fa
SID EID(f) EID(b)
- - -
fb
SID EID(f) EID(d)
1 3 4
fd
SID EID(f) EID(f)
1 3 4
ff
SPADE
 Reduces scans of the sequence database
 As the length of the frequent sequence increases, the
size of ID_list decreases – results in fast joins
 But large set of candidates are still generated
17
PrefixSpan: Prefix-Projected Sequential
Pattern Growth
 Pattern Growth – does not require candidate
generation
 Finds frequent single items
 Constructs FP-tree
 Projected databases associated with each frequent
item are generated from FP-tree
 Builds prefix patterns which it concatenates with suffix
patterns to find frequent patterns
18
PrefixSpan
 Given a sequence α = <e1e2…en> (where each ei corresponds to a
frequent event), a sequence β = <e’1e’2…e’m> (m<=n) is called a
prefix of α iff
 e'i = ei for i<=m-1
 e'm ⊆ em
 All frequent items in (em– e’m) are alphabetically after those in e’m
 Sequence γ = <e”m em+1…en> is called the suffix of α wrt prefix β
denoted as γ = α/β where e”m = (em – e’m )
19
<a>, <aa>, <a(ab)> and <a(abc)> are prefixes of sequence <a(abc)(ac)d(cf)>
Prefix Suffix (Prefix-Based Projection)
<a> <(abc)(ac)d(cf)>
<aa> <(_bc)(ac)d(cf)>
<a(ab)> <(_c)(ac)d(cf)>
PrefixSpan
 Mining Sequential patterns:
 {<x1>,<x2>,…<xn>} – complete set of length-1 sequential patterns.
 The complete set of Sequential patterns in S can be partitioned into n disjoint
subsets.
 ith
subset is the set of sequential patterns with prefix <xi>
 α : length-l sequential pattern and {β1, β2,… βm} – set of all length
(l+1) sequential patterns with prefix α.
 The complete set of sequential patterns with α as a prefix can be partitioned
into m disjoint subsets
 jth
subset is the set of patterns prefixed with βj
20
PrefixSpan Example
 Step 1: find length-1 sequential patterns
 <a>, <b>, <c>, <d>, <e>, <f>
 Step 2: divide search space. The complete set of seq.
pat. can be partitioned into 6 subsets:
 The ones having prefix <a>;
 The ones having prefix <b>;
 …
 The ones having prefix <f>
21
SID sequence
10 <a(abc)(ac)d(cf)>
20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb>
40 <eg(af)cbc>
PrefixSpan : Example
 Only need to consider projections w.r.t. <a>
 <a>-projected database: (only first occurrence of a is considered) <(abc)
(ac)d(cf)>, <(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc>
 In <a> projected database – frequent items are a:2, b:4, _b:2, c:4, d:2
and f:2
 Find all the length-2 seq. pat. Having prefix <a>: <aa>, <ab>,
<(ab)>, <ac>, <ad>, <af>
 Further partition into 6 subsets
 Having prefix <aa> - {< (_bc)(ac)d(cf)>, <(_e)>} – No frequent subsequences
 Having prefix <(ab)> - <(_c)(ac)d(cf)> and <(df)cb>
 Frequent patterns: <c> <d> <f> <dc> : <(ab)c> <(ab)d> <(ab)f> <(ab)dc>
 Having prefix <ac> - <(ac)d(cf)>, <(bc)(ae)>, <b>, <bc>
 Frequent patterns: <a> <b> <c> : <aca> <acb> <acc>
….
22
SID sequence
10 <a(abc)(ac)d(cf)>
20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb>
40 <eg(af)cbc>
PrefixSpan
 No candidate sequence needs to be generated
 Projected databases keep shrinking
 Major cost of PrefixSpan: constructing projected databases
 Can be improved by pseudo-projections
 Pseudo-projection
 Registers the index of the corresponding sequence and the starting
position of the projected suffix in the sequence instead of physical
projection
 Avoids physically copying postfixes
 Efficient in running time and space when database can be held in main
memory
 For large data combination of physical and pseudo projection
23
Sequential Pattern Mining
 Performance rating : PrefixSpan, SPADE, GSP
 All three are slow when there is a large number of frequent
subsequences
 Closed Sequential Patterns
 Closed Subsequences – contain no super sequence with the same support
 Reduces number of sequences considered
 CloSpan
 Based on equivalence of projected databases
 Two projected sequence databases are equivalent iff the total number of items match
 CloSpan prunes non-closed sequences whenever two projected databases are exactly the same by
stopping the growth of one
 Requires Post-processing to eliminate any remaining non-closed sequential patterns
 BIDE (Bidirectional Search) algorithm avoids additional checking
24
Sequential Pattern Mining
 Multi-dimensional, multi-level Sequential patterns
 Additional information maybe associated with Sequence ID –
Customer age, address, group and profession
 Additional information associated with items – item category, brand,
model type, model number, place…
 Example: “Retired customers who purchase a digital camera are
likely to purchase a color printer within a month”
 Additional information can be attached with Sequence ID / Item ID
25
Constraint based Mining of
Sequential Patterns
 Un-focused mining reduces the efficiency and usability of
frequent-pattern mining
 Constraint based mining incorporates user-specified
constraints to reduce the search space
 Regular expressions can be used to specify pattern templates
 Helps to improve efficiency of mining and interestingness of
patterns
26
Constraint based Mining
 Constraints related to duration T
 Constraints related to maximal or minimal length – anti-
monotonic / monotonic constraints
 Anti-monotonic constraint: T<= 10
 Monotonic : T > 10
 Succinct : T = 2005
 Periodic patterns – related to sets of partitioned sequences such
as every two weeks before and after an earthquake
27
Constraint based Mining
 Event folding window w
 specifies the periodicity for treating events as occurring together
 w=0 – No event sequence folding
 w=T – time-insensitive frequent patterns
 Gap between events
 Gap=0 – Strictly consecutive sequential patterns
 min_gap and max_gap
 Exact gaps and approximate gaps
 Serial Episodes
 Set of events occurring in total order
 Parallel Episodes
 Occurrence order is trivial
 Examples
 (A|B)C*(D|E) – A and B first occur (relative order is unimportant) followed by
any number of C events followed by D and E (in any order)
 C = <a*{bb|(bc)d|dd}> a-projected databases followed by SuffixSpan
28
Periodicity Analysis for Time-Related
Sequence data
 Periodicity analysis – mining of periodic patterns – searching for
recurring patterns in time-related sequence data
 Seasons, tides, traffic patterns, power consumption
 Often performed over time-series data
 Full periodic pattern
 Every point in time contributes (precisely or approximately) to the periodicity
 Example: All days in a year – contribute to season cycle
 Partial periodic pattern
 Specifies periodic behavior of a time-related sequence at some but not all points
of time
 Example: ABC reads the paper between 7:00-7:30 am every week day
29
Periodicity Analysis
 Synchronous periodicity – event occurs at a relatively fixed offset in
each “stable” period
 3 pm every day
 Asynchronous periodicity – event fluctuates in loosely defined
period
 Precise or approximate – depending on data value or offset within a
period
 Mining partial periodicity – leads to the discovery of cyclic or
periodic association rules (rules that associate a set of events that
occur periodically)
 Example: If tea sells well between 3 – 5 pm dinner will also sell well
between 7 – 9 pm on weekends
30

Más contenido relacionado

La actualidad más candente

Decision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data scienceDecision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data scienceMaryamRehman6
 
Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641Aiswaryadevi Jaganmohan
 
Parallel and Distributed Information Retrieval System
Parallel and Distributed Information Retrieval SystemParallel and Distributed Information Retrieval System
Parallel and Distributed Information Retrieval Systemvimalsura
 
3.3 hierarchical methods
3.3 hierarchical methods3.3 hierarchical methods
3.3 hierarchical methodsKrish_ver2
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and predictionDataminingTools Inc
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methodsKrish_ver2
 
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis Salah Amean
 
4.3 multimedia datamining
4.3 multimedia datamining4.3 multimedia datamining
4.3 multimedia dataminingKrish_ver2
 
Density based methods
Density based methodsDensity based methods
Density based methodsSVijaylakshmi
 
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Salah Amean
 
Clustering: Large Databases in data mining
Clustering: Large Databases in data miningClustering: Large Databases in data mining
Clustering: Large Databases in data miningZHAO Sam
 
Semantic nets in artificial intelligence
Semantic nets in artificial intelligenceSemantic nets in artificial intelligence
Semantic nets in artificial intelligenceharshita virwani
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsMd. Main Uddin Rony
 

La actualidad más candente (20)

Decision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data scienceDecision tree induction \ Decision Tree Algorithm with Example| Data science
Decision tree induction \ Decision Tree Algorithm with Example| Data science
 
Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641Data mining-primitives-languages-and-system-architectures2641
Data mining-primitives-languages-and-system-architectures2641
 
Sequential Pattern Mining and GSP
Sequential Pattern Mining and GSPSequential Pattern Mining and GSP
Sequential Pattern Mining and GSP
 
Parallel and Distributed Information Retrieval System
Parallel and Distributed Information Retrieval SystemParallel and Distributed Information Retrieval System
Parallel and Distributed Information Retrieval System
 
3.3 hierarchical methods
3.3 hierarchical methods3.3 hierarchical methods
3.3 hierarchical methods
 
Dynamic Itemset Counting
Dynamic Itemset CountingDynamic Itemset Counting
Dynamic Itemset Counting
 
Data mining: Classification and prediction
Data mining: Classification and predictionData mining: Classification and prediction
Data mining: Classification and prediction
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
 
Clustering
ClusteringClustering
Clustering
 
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
 
4.3 multimedia datamining
4.3 multimedia datamining4.3 multimedia datamining
4.3 multimedia datamining
 
Lect12 graph mining
Lect12 graph miningLect12 graph mining
Lect12 graph mining
 
Density based methods
Density based methodsDensity based methods
Density based methods
 
Lecture13 - Association Rules
Lecture13 - Association RulesLecture13 - Association Rules
Lecture13 - Association Rules
 
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
 
Clustering: Large Databases in data mining
Clustering: Large Databases in data miningClustering: Large Databases in data mining
Clustering: Large Databases in data mining
 
Fuzzy Clustering(C-means, K-means)
Fuzzy Clustering(C-means, K-means)Fuzzy Clustering(C-means, K-means)
Fuzzy Clustering(C-means, K-means)
 
Apriori Algorithm
Apriori AlgorithmApriori Algorithm
Apriori Algorithm
 
Semantic nets in artificial intelligence
Semantic nets in artificial intelligenceSemantic nets in artificial intelligence
Semantic nets in artificial intelligence
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
 

Similar a 5.3 mining sequential patterns

ARM_03_FPtreefrequency pattern data warehousing .ppt
ARM_03_FPtreefrequency pattern data warehousing .pptARM_03_FPtreefrequency pattern data warehousing .ppt
ARM_03_FPtreefrequency pattern data warehousing .pptChellamuthuHaripriya
 
Frequent itemset mining using pattern growth method
Frequent itemset mining using pattern growth methodFrequent itemset mining using pattern growth method
Frequent itemset mining using pattern growth methodShani729
 
Cs501 mining frequentpatterns
Cs501 mining frequentpatternsCs501 mining frequentpatterns
Cs501 mining frequentpatternsKamal Singh Lodhi
 
Mining Approach for Updating Sequential Patterns
Mining Approach for Updating Sequential PatternsMining Approach for Updating Sequential Patterns
Mining Approach for Updating Sequential PatternsIOSR Journals
 
Chapter - 8.3 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 8.3 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 8.3 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 8.3 Data Mining Concepts and Techniques 2nd Ed slides Han & Kambererror007
 
ICDM 2011 - Efficient Mining of Closed Sequential Patterns on Stream Sliding ...
ICDM 2011 - Efficient Mining of Closed Sequential Patterns on Stream Sliding ...ICDM 2011 - Efficient Mining of Closed Sequential Patterns on Stream Sliding ...
ICDM 2011 - Efficient Mining of Closed Sequential Patterns on Stream Sliding ...Chuancong Gao
 
Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Maté Ongenaert
 
Continuation Passing Style and Macros in Clojure - Jan 2012
Continuation Passing Style and Macros in Clojure - Jan 2012Continuation Passing Style and Macros in Clojure - Jan 2012
Continuation Passing Style and Macros in Clojure - Jan 2012Leonardo Borges
 
Declare Your Language: Virtual Machines & Code Generation
Declare Your Language: Virtual Machines & Code GenerationDeclare Your Language: Virtual Machines & Code Generation
Declare Your Language: Virtual Machines & Code GenerationEelco Visser
 
PrefixSpan With Spark at Akanoo
PrefixSpan With Spark at AkanooPrefixSpan With Spark at Akanoo
PrefixSpan With Spark at AkanooAkanoo
 
Workshop on command line tools - day 2
Workshop on command line tools - day 2Workshop on command line tools - day 2
Workshop on command line tools - day 2Leandro Lima
 
Analyzing the Performance Effects of Meltdown + Spectre on Apache Spark Workl...
Analyzing the Performance Effects of Meltdown + Spectre on Apache Spark Workl...Analyzing the Performance Effects of Meltdown + Spectre on Apache Spark Workl...
Analyzing the Performance Effects of Meltdown + Spectre on Apache Spark Workl...Databricks
 
Periodic pattern mining
Periodic pattern miningPeriodic pattern mining
Periodic pattern miningAshis Chanda
 

Similar a 5.3 mining sequential patterns (20)

My6asso
My6assoMy6asso
My6asso
 
ARM_03_FPtreefrequency pattern data warehousing .ppt
ARM_03_FPtreefrequency pattern data warehousing .pptARM_03_FPtreefrequency pattern data warehousing .ppt
ARM_03_FPtreefrequency pattern data warehousing .ppt
 
Graph mining ppt
Graph mining pptGraph mining ppt
Graph mining ppt
 
03
0303
03
 
Frequent itemset mining using pattern growth method
Frequent itemset mining using pattern growth methodFrequent itemset mining using pattern growth method
Frequent itemset mining using pattern growth method
 
Cs501 mining frequentpatterns
Cs501 mining frequentpatternsCs501 mining frequentpatterns
Cs501 mining frequentpatterns
 
Mining Approach for Updating Sequential Patterns
Mining Approach for Updating Sequential PatternsMining Approach for Updating Sequential Patterns
Mining Approach for Updating Sequential Patterns
 
Chapter - 8.3 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 8.3 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 8.3 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 8.3 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
 
ICDM 2011 - Efficient Mining of Closed Sequential Patterns on Stream Sliding ...
ICDM 2011 - Efficient Mining of Closed Sequential Patterns on Stream Sliding ...ICDM 2011 - Efficient Mining of Closed Sequential Patterns on Stream Sliding ...
ICDM 2011 - Efficient Mining of Closed Sequential Patterns on Stream Sliding ...
 
SPADE -
SPADE - SPADE -
SPADE -
 
Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Workshop NGS data analysis - 2
Workshop NGS data analysis - 2
 
Continuation Passing Style and Macros in Clojure - Jan 2012
Continuation Passing Style and Macros in Clojure - Jan 2012Continuation Passing Style and Macros in Clojure - Jan 2012
Continuation Passing Style and Macros in Clojure - Jan 2012
 
presentation
presentationpresentation
presentation
 
Declare Your Language: Virtual Machines & Code Generation
Declare Your Language: Virtual Machines & Code GenerationDeclare Your Language: Virtual Machines & Code Generation
Declare Your Language: Virtual Machines & Code Generation
 
PrefixSpan With Spark at Akanoo
PrefixSpan With Spark at AkanooPrefixSpan With Spark at Akanoo
PrefixSpan With Spark at Akanoo
 
Seminar PSU 10.10.2014 mme
Seminar PSU 10.10.2014 mmeSeminar PSU 10.10.2014 mme
Seminar PSU 10.10.2014 mme
 
Workshop on command line tools - day 2
Workshop on command line tools - day 2Workshop on command line tools - day 2
Workshop on command line tools - day 2
 
Analyzing the Performance Effects of Meltdown + Spectre on Apache Spark Workl...
Analyzing the Performance Effects of Meltdown + Spectre on Apache Spark Workl...Analyzing the Performance Effects of Meltdown + Spectre on Apache Spark Workl...
Analyzing the Performance Effects of Meltdown + Spectre on Apache Spark Workl...
 
Periodic pattern mining
Periodic pattern miningPeriodic pattern mining
Periodic pattern mining
 
Periodic pattern mining
Periodic pattern miningPeriodic pattern mining
Periodic pattern mining
 

Más de Krish_ver2

5.5 back tracking
5.5 back tracking5.5 back tracking
5.5 back trackingKrish_ver2
 
5.5 back track
5.5 back track5.5 back track
5.5 back trackKrish_ver2
 
5.5 back tracking 02
5.5 back tracking 025.5 back tracking 02
5.5 back tracking 02Krish_ver2
 
5.4 randomized datastructures
5.4 randomized datastructures5.4 randomized datastructures
5.4 randomized datastructuresKrish_ver2
 
5.4 randomized datastructures
5.4 randomized datastructures5.4 randomized datastructures
5.4 randomized datastructuresKrish_ver2
 
5.4 randamized algorithm
5.4 randamized algorithm5.4 randamized algorithm
5.4 randamized algorithmKrish_ver2
 
5.3 dynamic programming 03
5.3 dynamic programming 035.3 dynamic programming 03
5.3 dynamic programming 03Krish_ver2
 
5.3 dynamic programming
5.3 dynamic programming5.3 dynamic programming
5.3 dynamic programmingKrish_ver2
 
5.3 dyn algo-i
5.3 dyn algo-i5.3 dyn algo-i
5.3 dyn algo-iKrish_ver2
 
5.2 divede and conquer 03
5.2 divede and conquer 035.2 divede and conquer 03
5.2 divede and conquer 03Krish_ver2
 
5.2 divide and conquer
5.2 divide and conquer5.2 divide and conquer
5.2 divide and conquerKrish_ver2
 
5.2 divede and conquer 03
5.2 divede and conquer 035.2 divede and conquer 03
5.2 divede and conquer 03Krish_ver2
 
5.1 greedyyy 02
5.1 greedyyy 025.1 greedyyy 02
5.1 greedyyy 02Krish_ver2
 
4.4 hashing ext
4.4 hashing  ext4.4 hashing  ext
4.4 hashing extKrish_ver2
 
4.4 external hashing
4.4 external hashing4.4 external hashing
4.4 external hashingKrish_ver2
 

Más de Krish_ver2 (20)

5.5 back tracking
5.5 back tracking5.5 back tracking
5.5 back tracking
 
5.5 back track
5.5 back track5.5 back track
5.5 back track
 
5.5 back tracking 02
5.5 back tracking 025.5 back tracking 02
5.5 back tracking 02
 
5.4 randomized datastructures
5.4 randomized datastructures5.4 randomized datastructures
5.4 randomized datastructures
 
5.4 randomized datastructures
5.4 randomized datastructures5.4 randomized datastructures
5.4 randomized datastructures
 
5.4 randamized algorithm
5.4 randamized algorithm5.4 randamized algorithm
5.4 randamized algorithm
 
5.3 dynamic programming 03
5.3 dynamic programming 035.3 dynamic programming 03
5.3 dynamic programming 03
 
5.3 dynamic programming
5.3 dynamic programming5.3 dynamic programming
5.3 dynamic programming
 
5.3 dyn algo-i
5.3 dyn algo-i5.3 dyn algo-i
5.3 dyn algo-i
 
5.2 divede and conquer 03
5.2 divede and conquer 035.2 divede and conquer 03
5.2 divede and conquer 03
 
5.2 divide and conquer
5.2 divide and conquer5.2 divide and conquer
5.2 divide and conquer
 
5.2 divede and conquer 03
5.2 divede and conquer 035.2 divede and conquer 03
5.2 divede and conquer 03
 
5.1 greedyyy 02
5.1 greedyyy 025.1 greedyyy 02
5.1 greedyyy 02
 
5.1 greedy
5.1 greedy5.1 greedy
5.1 greedy
 
5.1 greedy 03
5.1 greedy 035.1 greedy 03
5.1 greedy 03
 
4.4 hashing02
4.4 hashing024.4 hashing02
4.4 hashing02
 
4.4 hashing
4.4 hashing4.4 hashing
4.4 hashing
 
4.4 hashing ext
4.4 hashing  ext4.4 hashing  ext
4.4 hashing ext
 
4.4 external hashing
4.4 external hashing4.4 external hashing
4.4 external hashing
 
4.2 bst
4.2 bst4.2 bst
4.2 bst
 

Último

Education and training program in the hospital APR.pptx
Education and training program in the hospital APR.pptxEducation and training program in the hospital APR.pptx
Education and training program in the hospital APR.pptxraviapr7
 
How to Manage Cross-Selling in Odoo 17 Sales
How to Manage Cross-Selling in Odoo 17 SalesHow to Manage Cross-Selling in Odoo 17 Sales
How to Manage Cross-Selling in Odoo 17 SalesCeline George
 
Clinical Pharmacy Introduction to Clinical Pharmacy, Concept of clinical pptx
Clinical Pharmacy  Introduction to Clinical Pharmacy, Concept of clinical pptxClinical Pharmacy  Introduction to Clinical Pharmacy, Concept of clinical pptx
Clinical Pharmacy Introduction to Clinical Pharmacy, Concept of clinical pptxraviapr7
 
5 charts on South Africa as a source country for international student recrui...
5 charts on South Africa as a source country for international student recrui...5 charts on South Africa as a source country for international student recrui...
5 charts on South Africa as a source country for international student recrui...CaraSkikne1
 
Diploma in Nursing Admission Test Question Solution 2023.pdf
Diploma in Nursing Admission Test Question Solution 2023.pdfDiploma in Nursing Admission Test Question Solution 2023.pdf
Diploma in Nursing Admission Test Question Solution 2023.pdfMohonDas
 
Unveiling the Intricacies of Leishmania donovani: Structure, Life Cycle, Path...
Unveiling the Intricacies of Leishmania donovani: Structure, Life Cycle, Path...Unveiling the Intricacies of Leishmania donovani: Structure, Life Cycle, Path...
Unveiling the Intricacies of Leishmania donovani: Structure, Life Cycle, Path...Dr. Asif Anas
 
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdf
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdfP4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdf
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdfYu Kanazawa / Osaka University
 
Drug Information Services- DIC and Sources.
Drug Information Services- DIC and Sources.Drug Information Services- DIC and Sources.
Drug Information Services- DIC and Sources.raviapr7
 
Department of Health Compounder Question ‍Solution 2022.pdf
Department of Health Compounder Question ‍Solution 2022.pdfDepartment of Health Compounder Question ‍Solution 2022.pdf
Department of Health Compounder Question ‍Solution 2022.pdfMohonDas
 
The Stolen Bacillus by Herbert George Wells
The Stolen Bacillus by Herbert George WellsThe Stolen Bacillus by Herbert George Wells
The Stolen Bacillus by Herbert George WellsEugene Lysak
 
10 Topics For MBA Project Report [HR].pdf
10 Topics For MBA Project Report [HR].pdf10 Topics For MBA Project Report [HR].pdf
10 Topics For MBA Project Report [HR].pdfJayanti Pande
 
Protein Structure - threading Protein modelling pptx
Protein Structure - threading Protein modelling pptxProtein Structure - threading Protein modelling pptx
Protein Structure - threading Protein modelling pptxvidhisharma994099
 
ARTICULAR DISC OF TEMPOROMANDIBULAR JOINT
ARTICULAR DISC OF TEMPOROMANDIBULAR JOINTARTICULAR DISC OF TEMPOROMANDIBULAR JOINT
ARTICULAR DISC OF TEMPOROMANDIBULAR JOINTDR. SNEHA NAIR
 
Vani Magazine - Quarterly Magazine of Seshadripuram Educational Trust
Vani Magazine - Quarterly Magazine of Seshadripuram Educational TrustVani Magazine - Quarterly Magazine of Seshadripuram Educational Trust
Vani Magazine - Quarterly Magazine of Seshadripuram Educational TrustSavipriya Raghavendra
 
Ultra structure and life cycle of Plasmodium.pptx
Ultra structure and life cycle of Plasmodium.pptxUltra structure and life cycle of Plasmodium.pptx
Ultra structure and life cycle of Plasmodium.pptxDr. Asif Anas
 
Prescribed medication order and communication skills.pptx
Prescribed medication order and communication skills.pptxPrescribed medication order and communication skills.pptx
Prescribed medication order and communication skills.pptxraviapr7
 
In - Vivo and In - Vitro Correlation.pptx
In - Vivo and In - Vitro Correlation.pptxIn - Vivo and In - Vitro Correlation.pptx
In - Vivo and In - Vitro Correlation.pptxAditiChauhan701637
 
How to Add Existing Field in One2Many Tree View in Odoo 17
How to Add Existing Field in One2Many Tree View in Odoo 17How to Add Existing Field in One2Many Tree View in Odoo 17
How to Add Existing Field in One2Many Tree View in Odoo 17Celine George
 
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRA
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRADUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRA
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRATanmoy Mishra
 
3.21.24 The Origins of Black Power.pptx
3.21.24  The Origins of Black Power.pptx3.21.24  The Origins of Black Power.pptx
3.21.24 The Origins of Black Power.pptxmary850239
 

Último (20)

Education and training program in the hospital APR.pptx
Education and training program in the hospital APR.pptxEducation and training program in the hospital APR.pptx
Education and training program in the hospital APR.pptx
 
How to Manage Cross-Selling in Odoo 17 Sales
How to Manage Cross-Selling in Odoo 17 SalesHow to Manage Cross-Selling in Odoo 17 Sales
How to Manage Cross-Selling in Odoo 17 Sales
 
Clinical Pharmacy Introduction to Clinical Pharmacy, Concept of clinical pptx
Clinical Pharmacy  Introduction to Clinical Pharmacy, Concept of clinical pptxClinical Pharmacy  Introduction to Clinical Pharmacy, Concept of clinical pptx
Clinical Pharmacy Introduction to Clinical Pharmacy, Concept of clinical pptx
 
5 charts on South Africa as a source country for international student recrui...
5 charts on South Africa as a source country for international student recrui...5 charts on South Africa as a source country for international student recrui...
5 charts on South Africa as a source country for international student recrui...
 
Diploma in Nursing Admission Test Question Solution 2023.pdf
Diploma in Nursing Admission Test Question Solution 2023.pdfDiploma in Nursing Admission Test Question Solution 2023.pdf
Diploma in Nursing Admission Test Question Solution 2023.pdf
 
Unveiling the Intricacies of Leishmania donovani: Structure, Life Cycle, Path...
Unveiling the Intricacies of Leishmania donovani: Structure, Life Cycle, Path...Unveiling the Intricacies of Leishmania donovani: Structure, Life Cycle, Path...
Unveiling the Intricacies of Leishmania donovani: Structure, Life Cycle, Path...
 
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdf
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdfP4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdf
P4C x ELT = P4ELT: Its Theoretical Background (Kanazawa, 2024 March).pdf
 
Drug Information Services- DIC and Sources.
Drug Information Services- DIC and Sources.Drug Information Services- DIC and Sources.
Drug Information Services- DIC and Sources.
 
Department of Health Compounder Question ‍Solution 2022.pdf
Department of Health Compounder Question ‍Solution 2022.pdfDepartment of Health Compounder Question ‍Solution 2022.pdf
Department of Health Compounder Question ‍Solution 2022.pdf
 
The Stolen Bacillus by Herbert George Wells
The Stolen Bacillus by Herbert George WellsThe Stolen Bacillus by Herbert George Wells
The Stolen Bacillus by Herbert George Wells
 
10 Topics For MBA Project Report [HR].pdf
10 Topics For MBA Project Report [HR].pdf10 Topics For MBA Project Report [HR].pdf
10 Topics For MBA Project Report [HR].pdf
 
Protein Structure - threading Protein modelling pptx
Protein Structure - threading Protein modelling pptxProtein Structure - threading Protein modelling pptx
Protein Structure - threading Protein modelling pptx
 
ARTICULAR DISC OF TEMPOROMANDIBULAR JOINT
ARTICULAR DISC OF TEMPOROMANDIBULAR JOINTARTICULAR DISC OF TEMPOROMANDIBULAR JOINT
ARTICULAR DISC OF TEMPOROMANDIBULAR JOINT
 
Vani Magazine - Quarterly Magazine of Seshadripuram Educational Trust
Vani Magazine - Quarterly Magazine of Seshadripuram Educational TrustVani Magazine - Quarterly Magazine of Seshadripuram Educational Trust
Vani Magazine - Quarterly Magazine of Seshadripuram Educational Trust
 
Ultra structure and life cycle of Plasmodium.pptx
Ultra structure and life cycle of Plasmodium.pptxUltra structure and life cycle of Plasmodium.pptx
Ultra structure and life cycle of Plasmodium.pptx
 
Prescribed medication order and communication skills.pptx
Prescribed medication order and communication skills.pptxPrescribed medication order and communication skills.pptx
Prescribed medication order and communication skills.pptx
 
In - Vivo and In - Vitro Correlation.pptx
In - Vivo and In - Vitro Correlation.pptxIn - Vivo and In - Vitro Correlation.pptx
In - Vivo and In - Vitro Correlation.pptx
 
How to Add Existing Field in One2Many Tree View in Odoo 17
How to Add Existing Field in One2Many Tree View in Odoo 17How to Add Existing Field in One2Many Tree View in Odoo 17
How to Add Existing Field in One2Many Tree View in Odoo 17
 
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRA
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRADUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRA
DUST OF SNOW_BY ROBERT FROST_EDITED BY_ TANMOY MISHRA
 
3.21.24 The Origins of Black Power.pptx
3.21.24  The Origins of Black Power.pptx3.21.24  The Origins of Black Power.pptx
3.21.24 The Origins of Black Power.pptx
 

5.3 mining sequential patterns

  • 2. Sequential Pattern Mining  Sequence database – consists of sequences of ordered elements or events (with or without time)  Sequential Pattern Mining is the mining of frequently occurring ordered events or subsequences as patterns  Example:  Customer shopping sequences:  First buy computer, then CD-ROM, and then digital camera, within 3 months.  Web access patterns, Weather prediction  Usually categorical or symbolic data  Numeric data analysis – Time Series Analysis 2
  • 3. Sequential Pattern Mining  I = {I1, I2, …Ip} – Set of items  Sequence s = <e1 e2 e3… el>  Ordered list of events  Each event is an element of the sequence  In Shopping databases, event – shopping trip in which customers bought items (x1 x2 …xq)  All trips by customer form a sequence  Item can occur at most once in an event, but several times in a sequence  Sequence with length l : l-sequence  A sequence α = <a1a2…an> is a subsequence of β = <b1b2..bm> denoted as α ⊆ β if there exists integers j1, j2, … jn between 1 and m such that a1 ⊆ bj1, a2 ⊆ bj2,… an ⊆ bjn  α = <(ab),d> and β = <(abc),(de)> α is a sub-sequence of β 3
  • 4. Sequential Pattern Mining  A sequence database, S, is a set of tuples, <SID,s>, where SID is a sequence ID and s is a sequence  A tuple <SID, s> is said to contain a sequence a, if a is a subsequence of s  The support of a sequence α in a sequence database S is the number of tuples in the database containing α  Given the minimum support threshold, a sequence a is frequent in sequence database S if support S(a) >= min sup  A frequent sequence is called a sequential pattern  A sequential pattern with length l is called an l-pattern 4
  • 5. Sequential Patterns  Length of Sequence 1 – 9  It contains ‘a’ multiple times but sequence 1 will contribute only one to the support of <a>  Sequence <a(bc)df> - Subsequence of Sequence 1  Support for <(ab)c> is 2 (Present in 1 and 3)  Frequent as it satisfies minimum support of 2 5 SID Sequence 1 <a(abc)(ac)d(cf)> 2 <(ad)c(bc)(ae)> 3 <(ef)(ab)(df)cb> 4 <eg(af)cbc>
  • 6. Scalable Methods for Mining Sequential Patterns  Full Set Vs Closed Set  A sequential patterns s is closed if there exists no s’ where s’ is a proper super-sequence of s and s’ has same support as s  Subsequences of a frequent sequence are also ferquent  Mining closed sequential patterns avoids generation of unnecessary sub-sequences  GSP – Candidate generate and test approach on horizontal data format  SPADE – Candidate generate and test approach on vertical data format  PrefixScan – does not require candidate generation  All approaches exploit Apriori property – every non empty subsequence of a sequential pattern is a sequential pattern 6
  • 7. GSP: Sequential Pattern Mining  GSP (Generalized Sequential Patterns)  Multi-pass, Candidate generate and test approach proposed by Agrawal and Srikant  Outline of the method  Initially, every item in DB is a candidate of length-1  for each level (i.e., sequences of length-k) do  Scan database to collect support count for each candidate sequence  Generate candidate length-(k+1) sequences from length-k frequent sequences using Apriori  A (k-1)-sequence w1 is merged with another (k-1)-sequence w2 to produce a candidate k-sequence if the subsequence obtained by removing the first event in w1 is the same as the subsequence obtained by removing the last event in w2  repeat until no frequent sequence or no candidate can be found  Major strength: Candidate pruning by Apriori  Weakness : Generates large number of candidates 7
  • 8. Generalized Sequential Patterns 8  Initial candidates: all singleton sequences  <a>, <b>, <c>, <d>, <e>, <f>, <g>, <h>  Scan database once, count support for candidates Seq. ID Sequence 1 <(cd)(abc)(abf)(acdf)> 2 <(abf)(e)> 3 <(abf)> 4 <(dgh)(bf)(agh)> Cand Sup <a> 4 <b> 4 <c> 1 <d> 2 <e> 1 <f> 4 <g> 1 <h> 1
  • 9. Generalized Sequential Patterns 9 Cand Sup <a> 4 <b> 4 <d> 2 <f> 4 Length 2 Candidates generated by joinLength 2 Candidates generated by join <aa> <ab> <ad> <af> <ba> <bb> <bd> <bf> <da> <db><dd> <df> <fa> <fb> <fd> <ff> <(ab)> <(ad)> <(af)> <(bd)> <(bf)> <(df)> Seq. ID Sequence 1 <(cd)(abc)(abf)(acdf)> 2 <(abf)(e)> 3 <(abf)> 4 <(dgh)(bf)(agh)> Length 2 Frequent SequencesLength 2 Frequent Sequences <ba> <da> <db> <df> <fa> <(ab)> <(af)> <(bf)>
  • 10. Generalized Sequential Patterns 10 Length 2 Frequent SequencesLength 2 Frequent Sequences <ba> <da> <db> <df> <fa> <(ab)> <(af)> <(bf)> Length 3 Candidates generated by joinLength 3 Candidates generated by join <ba> and <(ab)> - <b(ab)> {1} <ba> and <(af)> - <b(af)> {1} <da> and <(ab)> - <d(ab)> {1} <da> and <(af)> - <d(af)> {1} <db> and <(bf)> - <d(bf)> {1, 4} <db> and <ba> - <dba> {1, 4} <df> and <fa> - <dfa> {1, 4} <fa> and <(ab)> - <f(ab)> - <fa> and <(af)> - <f(af)> {1} <(ab)> and <(bf)> - <(abf)> {1,2,3} <(ab)> and <ba> - <(ab)a> {1} <(af)> and <fa> - <(af)a) {1} <(bf)> and <fa> - <(bf)a> {1, 4} Seq. ID Sequence 1 <(cd)(abc)(abf)(acdf)> 2 <(abf)(e)> 3 <(abf)> 4 <(dgh)(bf)(agh)> Length 3 Frequent SequencesLength 3 Frequent Sequences <dba> <dfa> <(abf)> <(bf)a> <d(bf)> Length 4 Candidates generated by joinLength 4 Candidates generated by join <d(bf)> and <(bf)a> - <d(bf)a> {1, 4} <(abf)> and <(bf)a> - <(abf)a> {1}
  • 11. SPADE: Sequential Pattern Mining on Vertical data format  SPADE – Sequential PAttern Discovery using Equivalent classes  Vertical data format <itemset: (sequence_ID, event_ID)>  ID_list : Set of (sequence_ID, event_ID) pairs for a given itemset  Mapping from horizontal to vertical format requires one scan  Support of k-sequences can be determined by joining the ID lists of (k-1) sequences  To find candidate 2-sequences, all single items are joined if they are frequent, if they share the same sequence identifier and if their event identifier follows a sequential ordering  Patterns are grown similarly 11
  • 13. SPADE - Example 13 Seq. ID Sequence 1 <(cd)(abc)(abf)(acdf)> 2 <(abf)(e)> 3 <(abf)> 4 <(dgh)(bf)(agh)> SID EID Itemset 1 1 cd 1 2 abc 1 3 abf 1 4 acdf 2 1 abf 2 2 e 3 1 abf 4 1 dgh 4 2 bf 4 3 agh Vertical Data Format
  • 14. SPADE - Example 14 SID EID Itemset 1 1 cd 1 2 abc 1 3 abf 1 4 acdf 2 1 abf 2 2 e 3 1 abf 4 1 dgh 4 2 bf 4 3 agh Vertical Data Format SID EID 1 2 1 3 1 4 2 1 3 1 4 3 a SID EID 1 2 1 3 2 1 3 1 4 2 b SID EID 1 1 1 2 1 4 c SID EID 1 1 1 4 4 1 d SID EID 2 2 e SID EID 1 3 1 4 2 1 3 1 4 2 f SID EID 4 1 4 3 g SID EID 4 1 4 3 h (ab) – 1,2; 1,3; 2,1; 3,1 (ad) – 1,4 (af) – 1,4; 2,1;3,1 (bd) – NIL (bf) – 1,3; 2,1; 3,1;4,2 (df) – 1,4
  • 15. SPADE - Example 15 SID EID 1 2 1 3 1 4 2 1 3 1 4 3 a SID EID 1 2 1 3 2 1 3 1 4 2 b SID EID 1 1 1 4 4 1 d SID EID 1 3 1 4 2 1 3 1 4 2 f SID EID(a) EID(a) 1 2 3 1 3 4 1 2 4 aa SID EID(a) EID(b) 1 2 3 ab SID EID(a) EID(d) 1 2 4 1 3 4 ad SID EID(a) EID(b) 1 2 3 1 3 4 af SID EID(b) EID(a) 1 2 3 1 3 4 4 2 3 ba SID EID(b) EID(b) 1 2 3 bb SID EID(b) EID(d) 1 2 4 1 3 4 bd SID EID(b) EID(f) 1 2 3 1 3 4 bf
  • 16. SPADE - Example 16 SID EID 1 2 1 3 1 4 2 1 3 1 4 3 a SID EID 1 2 1 3 2 1 3 1 4 2 b SID EID 1 1 1 4 4 1 d SID EID 1 3 1 4 2 1 3 1 4 2 f SID EID(d) EID(a) 1 1 2 1 1 3 1 1 4 4 1 3 da SID EID(d) EID(b) 1 1 2 1 1 3 4 1 2 db SID EID(d) EID(d) 1 1 4 dd SID EID(d) EID(f) 1 1 3 1 1 4 4 1 2 df SID EID(f) EID(a) 1 3 4 4 2 3 fa SID EID(f) EID(b) - - - fb SID EID(f) EID(d) 1 3 4 fd SID EID(f) EID(f) 1 3 4 ff
  • 17. SPADE  Reduces scans of the sequence database  As the length of the frequent sequence increases, the size of ID_list decreases – results in fast joins  But large set of candidates are still generated 17
  • 18. PrefixSpan: Prefix-Projected Sequential Pattern Growth  Pattern Growth – does not require candidate generation  Finds frequent single items  Constructs FP-tree  Projected databases associated with each frequent item are generated from FP-tree  Builds prefix patterns which it concatenates with suffix patterns to find frequent patterns 18
  • 19. PrefixSpan  Given a sequence α = <e1e2…en> (where each ei corresponds to a frequent event), a sequence β = <e’1e’2…e’m> (m<=n) is called a prefix of α iff  e'i = ei for i<=m-1  e'm ⊆ em  All frequent items in (em– e’m) are alphabetically after those in e’m  Sequence γ = <e”m em+1…en> is called the suffix of α wrt prefix β denoted as γ = α/β where e”m = (em – e’m ) 19 <a>, <aa>, <a(ab)> and <a(abc)> are prefixes of sequence <a(abc)(ac)d(cf)> Prefix Suffix (Prefix-Based Projection) <a> <(abc)(ac)d(cf)> <aa> <(_bc)(ac)d(cf)> <a(ab)> <(_c)(ac)d(cf)>
  • 20. PrefixSpan  Mining Sequential patterns:  {<x1>,<x2>,…<xn>} – complete set of length-1 sequential patterns.  The complete set of Sequential patterns in S can be partitioned into n disjoint subsets.  ith subset is the set of sequential patterns with prefix <xi>  α : length-l sequential pattern and {β1, β2,… βm} – set of all length (l+1) sequential patterns with prefix α.  The complete set of sequential patterns with α as a prefix can be partitioned into m disjoint subsets  jth subset is the set of patterns prefixed with βj 20
  • 21. PrefixSpan Example  Step 1: find length-1 sequential patterns  <a>, <b>, <c>, <d>, <e>, <f>  Step 2: divide search space. The complete set of seq. pat. can be partitioned into 6 subsets:  The ones having prefix <a>;  The ones having prefix <b>;  …  The ones having prefix <f> 21 SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc>
  • 22. PrefixSpan : Example  Only need to consider projections w.r.t. <a>  <a>-projected database: (only first occurrence of a is considered) <(abc) (ac)d(cf)>, <(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc>  In <a> projected database – frequent items are a:2, b:4, _b:2, c:4, d:2 and f:2  Find all the length-2 seq. pat. Having prefix <a>: <aa>, <ab>, <(ab)>, <ac>, <ad>, <af>  Further partition into 6 subsets  Having prefix <aa> - {< (_bc)(ac)d(cf)>, <(_e)>} – No frequent subsequences  Having prefix <(ab)> - <(_c)(ac)d(cf)> and <(df)cb>  Frequent patterns: <c> <d> <f> <dc> : <(ab)c> <(ab)d> <(ab)f> <(ab)dc>  Having prefix <ac> - <(ac)d(cf)>, <(bc)(ae)>, <b>, <bc>  Frequent patterns: <a> <b> <c> : <aca> <acb> <acc> …. 22 SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc>
  • 23. PrefixSpan  No candidate sequence needs to be generated  Projected databases keep shrinking  Major cost of PrefixSpan: constructing projected databases  Can be improved by pseudo-projections  Pseudo-projection  Registers the index of the corresponding sequence and the starting position of the projected suffix in the sequence instead of physical projection  Avoids physically copying postfixes  Efficient in running time and space when database can be held in main memory  For large data combination of physical and pseudo projection 23
  • 24. Sequential Pattern Mining  Performance rating : PrefixSpan, SPADE, GSP  All three are slow when there is a large number of frequent subsequences  Closed Sequential Patterns  Closed Subsequences – contain no super sequence with the same support  Reduces number of sequences considered  CloSpan  Based on equivalence of projected databases  Two projected sequence databases are equivalent iff the total number of items match  CloSpan prunes non-closed sequences whenever two projected databases are exactly the same by stopping the growth of one  Requires Post-processing to eliminate any remaining non-closed sequential patterns  BIDE (Bidirectional Search) algorithm avoids additional checking 24
  • 25. Sequential Pattern Mining  Multi-dimensional, multi-level Sequential patterns  Additional information maybe associated with Sequence ID – Customer age, address, group and profession  Additional information associated with items – item category, brand, model type, model number, place…  Example: “Retired customers who purchase a digital camera are likely to purchase a color printer within a month”  Additional information can be attached with Sequence ID / Item ID 25
  • 26. Constraint based Mining of Sequential Patterns  Un-focused mining reduces the efficiency and usability of frequent-pattern mining  Constraint based mining incorporates user-specified constraints to reduce the search space  Regular expressions can be used to specify pattern templates  Helps to improve efficiency of mining and interestingness of patterns 26
  • 27. Constraint based Mining  Constraints related to duration T  Constraints related to maximal or minimal length – anti- monotonic / monotonic constraints  Anti-monotonic constraint: T<= 10  Monotonic : T > 10  Succinct : T = 2005  Periodic patterns – related to sets of partitioned sequences such as every two weeks before and after an earthquake 27
  • 28. Constraint based Mining  Event folding window w  specifies the periodicity for treating events as occurring together  w=0 – No event sequence folding  w=T – time-insensitive frequent patterns  Gap between events  Gap=0 – Strictly consecutive sequential patterns  min_gap and max_gap  Exact gaps and approximate gaps  Serial Episodes  Set of events occurring in total order  Parallel Episodes  Occurrence order is trivial  Examples  (A|B)C*(D|E) – A and B first occur (relative order is unimportant) followed by any number of C events followed by D and E (in any order)  C = <a*{bb|(bc)d|dd}> a-projected databases followed by SuffixSpan 28
  • 29. Periodicity Analysis for Time-Related Sequence data  Periodicity analysis – mining of periodic patterns – searching for recurring patterns in time-related sequence data  Seasons, tides, traffic patterns, power consumption  Often performed over time-series data  Full periodic pattern  Every point in time contributes (precisely or approximately) to the periodicity  Example: All days in a year – contribute to season cycle  Partial periodic pattern  Specifies periodic behavior of a time-related sequence at some but not all points of time  Example: ABC reads the paper between 7:00-7:30 am every week day 29
  • 30. Periodicity Analysis  Synchronous periodicity – event occurs at a relatively fixed offset in each “stable” period  3 pm every day  Asynchronous periodicity – event fluctuates in loosely defined period  Precise or approximate – depending on data value or offset within a period  Mining partial periodicity – leads to the discovery of cyclic or periodic association rules (rules that associate a set of events that occur periodically)  Example: If tea sells well between 3 – 5 pm dinner will also sell well between 7 – 9 pm on weekends 30